from:"Dean Rasheed"

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Dean Rasheed

On 27 November 2017 at 16:35, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> Dean Rasheed <dean.a.rash...@gmail.com> writes:
>>> A separate point -- it might be marginally more efficient to have the
>>> work of rewriteTargetListUD() done after expand_targetlist() to avoid
>>> the possible renumbering of the resjunk entries.
>
>> Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
>> a bit silly when you put it like that.
>
> On looking closer, the reason it's like that in Fujita-san's patch
> is to minimize the API churn seen by FDW AddForeignUpdateTargets
> functions, specifically whether they see a tlist that's before or
> after what expand_targetlist() does.  I'm doubtful that the
> potential savings is worth taking risks there.  In particular,
> it seems like a good thing that expand_targetlist() verifies the
> correct tlist ordering *after* the FDW function has acted.
> So now my inclination is to leave this alone.
>

Ah yes, that seems like a worthwhile check to keep. Never mind then.

Regards,
Dean

Re: NaNs in numeric_power (was Re: Postgres 11 release notes)

2018-05-16 Thread Dean Rasheed

On 15 May 2018 at 22:55, Tom Lane  wrote:
> David Rowley  writes:
>> On 16 May 2018 at 02:01, Tom Lane  wrote:
>>> I'm not particularly fussed about getting credit for that.  However,
>>> looking again at how that patch series turned out --- ie, that
>>> we ensured POSIX behavior for NaNs only in HEAD --- I wonder
>>> whether we shouldn't do what was mentioned in the commit log for
>>> 6bdf1303, and teach numeric_pow() about these same special cases.
>>> It seems like it would be more consistent to change both functions
>>> for v11, rather than letting that other shoe drop in some future
>>> major release.
>
>> I'm inclined to agree. It's hard to imagine these two functions
>> behaving differently in regards to NaN input is useful to anyone.
>
> Here's a proposed patch for that.
>

In the case 1 ^ NaN = 1, what should the result scale be?

For other inputs, the result scale is at least as large as the scale
of the first input, so I would suggest that the same ought to be the
case here.

Otherwise, this looks fine, and I agree that it makes thinks neater /
more consistent.

Regards,
Dean

Re: NaNs in numeric_power (was Re: Postgres 11 release notes)

2018-05-16 Thread Dean Rasheed

On 16 May 2018 at 14:44, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rash...@gmail.com> writes:
>> In the case 1 ^ NaN = 1, what should the result scale be?
>
> The result is exact, so I don't see a reason to be worried about its
> scale.  Messing with the scale would also require expending even
> more code on what is, in the end, a corner case that no users have
> expressed concern about.
>

OK, fine. Not really worth worrying about.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-01-21 Thread Dean Rasheed

On 21 January 2018 at 07:26, John Naylor <jcnay...@gmail.com> wrote:
> I spent a few hours hacking on this, and it turns out calculating the
> right number of MCVs taking into account both uniform and highly
> non-uniform distributions is too delicate a problem for me to solve
> right now. The logic suggested by Dean Rasheed in [1] always produces
> no MCVs for a perfectly uniform distribution (which is good), but very
> often also for other distributions, which is not good. My efforts to
> tweak that didn't work, so I didn't get as far as adapting it for the
> problem Jeff is trying to solve.

Hmm, Tom suggested that the test based on the average frequency over
all values might be too strict because the estimated number of
distinct values is often too low, so that might explain what you're
seeing.

It occurs to me that maybe a better test to exclude a value from the
MCV list would be to demand that its relative standard error not be
too high. Such a test, in addition to the existing tests, might be
sufficient to solve the opposite problem of too many values in the MCV
list, because the real problem there is including a value after having
seen relatively few occurrences of it in the sample, and thus having a
wildly inaccurate estimate for it. Setting a bound on the relative
standard error would mean that we could have a reasonable degree of
confidence in estimates produced from the sample.

Regards,
Dean

Re: stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions)

2018-01-22 Thread Dean Rasheed

On 22 January 2018 at 08:07, John Naylor <jcnay...@gmail.com> wrote:
> On 1/21/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>> It occurs to me that maybe a better test to exclude a value from the
>> MCV list would be to demand that its relative standard error not be
>> too high. Such a test, in addition to the existing tests, might be
>> sufficient to solve the opposite problem of too many values in the MCV
>> list, because the real problem there is including a value after having
>> seen relatively few occurrences of it in the sample, and thus having a
>> wildly inaccurate estimate for it. Setting a bound on the relative
>> standard error would mean that we could have a reasonable degree of
>> confidence in estimates produced from the sample.
>
> If you don't mind, what would the math look like for that?

Using the same syntax as before:

N = Total rows in table (population size)
n = Number of rows sampled
x = Number of times a particular value appears in the sample
p = x/n = Frequency of the value in the sample

So that the standard error of the proportion is

SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))

Then the relative standard error (which is usually expressed as a percentage) is

RSE = 100 * SE / p

So if we were to demand that the relative standard error was less
than, say, 10%, then the constraint would just be

SE < 0.1 * p

Note:

* This formula not valid if x is very small (see what I wrote about
being able to approximate the distribution of p with a normal
distribution). So we should also enforce the "rule of thumb" x >= 10.

* The frequency p that we're storing is the count divided by the
overall sample size. So the values for N and n above should not be the
counts that exclude NULLs. As far as this logic is concerned, NULLs
(and too-wide values) are just values not equal to value under
consideration. Thus it appears, from just a quick glance at your
patch, that you're using the wrong values for N and n.

* The RSE is a monotonically decreasing function of p in the range
[0,1], so an upper bound on the RSE equates to a lower bound on the
number of occurrences of the value.

This last point gives me a different idea. Instead of applying this
test *in addition to* the existing tests, how about applying it
*instead of* the existing tests. That is, we keep all MCVs that appear
sufficiently often that we can be reasonably confident in the
estimates produced from their sample frequencies, regardless of
whether or not they are more common than average (the average being
fairly meaningless for a highly skewed distribution anyway).

This is still keeping the most common values, but now we'd be saying
that we keep any value that appears sufficiently often in the sample
that we can extrapolate its sample frequency to produce a reasonably
accurate estimate of the population frequency, and discard values for
which the estimate is likely to be inaccurate.

I have not tested this idea at all, but it seems like it might be
worth considering. It has the nice property that the test depends
entirely on how often the value appears, rather than on other
previously computed statistics, such as Ndistinct.

Doing a quick test in pure SQL, using the highly skewed distribution
Jeff Janes gave in the other thread, with the default sample size of
30,000:

with samples as (
  select floor(-log(random())/log(2))::int  as who
  from generate_series(1,3)
), freqs as (
  select who, count(*) as x, count(*)/3::float8 as p
  from samples group by who
), stats as (
  select *, sqrt(p*(1-p)/3) *
sqrt((1000-3)::float8/(1000-1)) as se
  from freqs
)
select *, (1000*p)::int||'+/-'||(2*se*1000)::int as "95% interval",
   case when x >=10 and se < 0.1*p then 'KEEP' else 'DISCARD' end
from stats order by p desc limit 100;

it pretty consistently keeps the 8 most common values:

 who |   x   |  p   |  se  |  95%
interval   |  case
-+---+--+--+-+-
   0 | 15017 |0.5005667 |  0.00288241625942075 |
5005667+/-57648 | KEEP
   1 |  7607 |0.2535667 |  0.00250800597590887 |
2535667+/-50160 | KEEP
   2 |  3713 |0.1237667 | 0.0018984483 |
1237667+/-37969 | KEEP
   3 |  1855 |   0.0618 |   0.0013884757600711 |
618333+/-27770  | KEEP
   4 |   914 |   0.03046667 | 0.000990788179299791 |
304667+/-19816  | KEEP
   5 |   448 |   0.0149 | 0.000699194759916533 |
149333+/-13984  | KEEP
   6 |   229 |  0.00763 | 0.000501741670388358 |
76333+/-10035   | KEEP
   7 |   108 |   0.0036 | 0.000345267009604061 |
36000+/-6905| KEEP
   8 |46 |  0.00153 | 0.000225565173744715 |
15333+/-4511| DISCARD
   9 |34 |  0.00113 | 0.000193963300230354 |
11333+/-3879|

Re: MCV lists for highly skewed distributions

2018-02-01 Thread Dean Rasheed

On 1 February 2018 at 13:16, Simon Riggs  wrote:
> On 25 January 2018 at 22:19, Tom Lane  wrote:
>> In any case, since it looks like the next step is for someone to come
>> up with a new proposal, I'm going to set this to Waiting on Author.
>
> Dean and John's results show that different algorithms work better for
> different cases.
>
> How about we make ANALYZE's MCV algorithm pluggable? And then include,
> say, 2 additional algorithms.
>

I don't think we've yet proved that that's needed. I'd rather attempt
to come up with a decent general-purpose algorithm, if possible.

The main problem I have with the originally proposed patch is the lack
of any kind of rigorous justification for the approach of changing the
algorithm to include values that are "significantly more common than
average frequency for values which will not be represented in the MCV
list". So there's no guarantee that the MCVs produced will be backed
by sufficient evidence, and it risks making the too-many-MCVs case
worse.

Of course the current code suffers from both the too-many-MCVs and
too-few-MCVs problems, depending on the data distribution:

For a reasonably uniform distribution with quite a large number of
distinct values, it is possible to generate MCVs on the basis of
having seen only a few instances of the values. Those few values seen
are then not sufficiently statistically significant, and extrapolating
to the whole table produces wildly inaccurate estimates.

For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.

There is also another variant of the too-many-MCVs problem that I
believe is also possible -- if the sample contains a large number of
NULLs or too-wide values, values_cnt could be reduced to the point
where maxmincount is quite small, and again MCVs could get included on
the basis of a very small number of occurrences.

I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.

With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency -- so we
include values in the MCV list if and only if they are seen enough
times in the sample that the standard error on the sample frequency is
small compared to the sample frequency itself, and thus we expect that
the relative error resulting from extrapolating that to the whole
table is also small. In theory then, it should never generate too many
MCVs, although tuning of the relative error threshold might be
required to prevent it from generating too few.

Note, this is not a finished patch (e.g., I haven't touched
compute_distinct_stats()). It's merely a possible starting point from
which a lot more testing will be required.

Testing it with the example from [1], it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).

Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/20170522132017.29944.48...@wrigleys.postgresql.org
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
new file mode 100644
index 5f21fcb..af92835
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2557,25 +2557,21 @@ compute_scalar_stats(VacAttrStatsP stats
 		 * Decide how many values are worth storing as most-common values. If
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
-		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
-		 * that have a frequency of at least 1/K where K is the intended
-		 * number of histogram bins; such values might otherwise cause us to
-		 * emit duplicate histogram bin boundaries.  (We might end up with
-		 * duplicate histogram entries anyway, if the distribution is skewed;
-		 * but we prefer to treat such values as MCVs if at all possible.)
+		 * then do so.  Otherwise, keep only those values that appear
+		 * sufficiently often in the sample that it is reasonable to
+		 *

Re: WINDOW RANGE patch versus leakproofness

2018-01-31 Thread Dean Rasheed

On 31 January 2018 at 21:51, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jan 31, 2018 at 5:52 AM, Dean Rasheed <dean.a.rash...@gmail.com> 
> wrote:
>> On 30 January 2018 at 16:42, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> So I'm thinking that (a) we do not need to check for leaky functions used
>>> in window support, and (b) therefore there's no need to avoid leaky
>>> behavior in in_range support functions.  Objections?
>>
>> Yes, I concur. Since window functions can only appear in the SELECT
>> target list and ORDER BY clauses, they should never appear in a qual
>> that gets considered for push down, and thus contain_leaked_vars()
>> should never see a window function.
>
> What about a query that uses window functions within a subquery?
>

A qual containing a subquery is treated as not push down safe, so it
wouldn't even be considered for pushing down into a security barrier
view. On a table with RLS, contain_leaked_vars() would see a subplan
on the restrictinfo's clause and return true, causing the restrictinfo
to be treated as leaky. So in each case, the qual with the subquery
would be executed after any security quals, regardless of what it
contained.

The contents of the subquery aren't currently examined, it just
defaults to leaky. If they were to be examined, the window function
would be found and it would still default to being leaky.

Regards,
Dean

Re: WINDOW RANGE patch versus leakproofness

2018-01-31 Thread Dean Rasheed

On 30 January 2018 at 16:42, Tom Lane  wrote:
> So I'm thinking that (a) we do not need to check for leaky functions used
> in window support, and (b) therefore there's no need to avoid leaky
> behavior in in_range support functions.  Objections?
>

Yes, I concur. Since window functions can only appear in the SELECT
target list and ORDER BY clauses, they should never appear in a qual
that gets considered for push down, and thus contain_leaked_vars()
should never see a window function.

Moreover, contain_leaked_vars() is intentionally coded defensively, so
if it ever does somehow see a window function (or any other unexpected
node type) it will return true and the resulting qual/restrictinfo
will be marked leaky, and not pushed through security barriers.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 1 February 2018 at 17:49, Robert Haas  wrote:
> One point which I want to emphasize is that the length of the MCV list
> bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
> ever thought to be more frequent than the least-common MCVs, and
> however many non-MCVs we think we have (probably fewer than we
> actually have) have to fit into whatever percentage of the table is
> consumed by MCVs.  This would be less important if we had reliable
> n_distinct estimates, but we don't.  So, even throwing things into the
> MCV list that are no more common than the average item can improve
> planning in some cases.
>

That's a good point, and a nice explanation. I think that lends more
weight to the argument that we should be including as many MCVs as
possible, provided there's enough evidence to justify their inclusion.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 4 February 2018 at 12:18, John Naylor  wrote:
> I did the same basic eyeball testing I did on earlier patches, and
> this is the best implementation so far. I've attached some typical
> pg_stats contents for HEAD and this patch. More rigorous testing,
> including of planner estimates on real data, is still needed of
> course, but I think this is definitely in the right direction.
>

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Looking at the too-many-MCVs example from [1], the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:

insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;

However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.

The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:

 rse_cutoff | sample=3 | sample=30 | sample=300
+--+---+
 10%|  100 |   100 | 97
 20%|   25 |25 | 25
 30%|   12 |12 | 11
 40%|7 | 7 |  7
 50%|4 | 4 |  4

So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.

The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:

stats_target  mcvs (HEAD)  mcvs (10% RSE)  mcvs (20% RSE)  mcvs (30% RSE)
1010   0   10  10
2020   0   20  20
3030   0   30  30
4040   17  40  40
5050   50  50  50
100   100  100 100 100
300   132  204 261 290
1000  205  266 315 338
3000  252  364 400 411
1 296  413 414 412

One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.

(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)

I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 7 February 2018 at 15:25, Robert Haas  wrote:
> Do you plan to press forward with this, then, or what's
> the next step?
>

Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.

I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.

Regards,
Dean

Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation

2018-07-26 Thread Dean Rasheed

On 26 July 2018 at 07:12, Ashutosh Bapat
 wrote:
> In the patch clauselist_selectivity() gets called repeatedly for every
> new qual added to the clause list. Instead, if we try to combine the
> cost/row estimation with order_qual_clauses() or
> clauselist_selectivity(), we might be able to what the current patch
> does in O(n). clauselist_selectivity() calculates combined selectivity
> of all the given quals in O(n). We should do something similar to that
> in this case.

Duplicating the logic of clauselist_selectivity() seems like quite a
lot of effort to go to just for improved filter cost estimates. Note
also that clauselist_selectivity() might get a good deal more complex
with multivariate stats.

Perhaps a reasonable simplification would be to just treat the clauses
as independent when estimating the filter cost, and so use the
following as a "good enough" estimate for the filter cost:

  cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...

That would probably be an improvement over what we currently have. It
would be O(n) to compute, and it would probably use the cached
selectivity estimates for the clauses.

Note also that with this simplified expression for the filter cost, it
would be possible to improve order_qual_clauses() to take into account
the clause selectivities when sorting the clauses, and minimise the
cost of the filter by evaluating more selective clauses sooner, if
they're not too expensive. I believe that amounts to sorting by
cost/(1-sel) rather than just cost, except for clauses with sel==1,
which it makes sense to move to the end, and just sort by cost.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-15 Thread Dean Rasheed

On 13 July 2018 at 18:27, Tomas Vondra  wrote:
> I'm not so sure. The issue is that a lot of the MCV deductions depends
> on whether we can answer questions like "Is there a single match?" or
> "If we got a match in MCV, do we need to look at the non-MCV part?" This
> is not very different from the single-column estimates, except of course
> here we need to look at multiple columns.
>
> The top-level clauses allow us to make such deductions, with deeper
> clauses it's much more difficult (perhaps impossible). Because for
> example with (a=1 AND b=1) there can be just a single match, so if we
> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
> b=2)) it's not that simple, because there may be multiple combinations
> and so a match in MCV does not guarantee anything.

Actually, it guarantees a lower bound on the overall selectivity, and
maybe that's the best that we can do in the absence of any other
stats.

I did wonder if maybe we could do better by tracking allowed value
counts. E.g., with a clause like ((a=1 OR a=2) AND (b=1 OR b=2)) it
would be fairly simple to see that there are 2 allowed values of a,
and 2 allowed values of b, so 4 allowed values overall. If we had,
say, 3 MCV matches, we'd then know to factor in something extra for
the 1 non-MCV match. I'm not sure what to do with non-equality clauses
though.

>> I think perhaps it might be better not to attempt to combine the
>> *overall* selectivity returned by mcv_clauselist_selectivity() with
>> that returned by clauselist_selectivity(), but rather merge the logic
>> of these two functions together into a single traversal of the
>> clause-tree. That way, the various individual selectivities can be
>> combined on a clause-by-clause basis to give the best running total
>> based on the available information. Even when the multivariate stats
>> are incomplete, they may still provide a useful lower bound on the
>> selectivity.
>
> I don't follow. The example you presented above showed multivariate
> stats producing over-estimates, so how would it be helpful to use that
> as lower boundary for anything?

No, the multivariate MCV stats were producing good estimates, even for
the complex clauses, because they were all common values in my
example. The problem was that the good MCV estimate was then being
ruined by adding on extra factors because at the top-level it didn't
appear to be a full match.

>> If/when all MCV columns have been matched exactly, that
>> lower bound might turn into the appropriate overall result, if there
>> is a matching MCV entry.
>
> Isn't the problem illustrated by the examples that we don't know if the
> MCV matches represent all matches, or if there may be matches in the
> histogram?

The example illustrated a case where the MCV matches represented all
the matches, but we failed to recognise that. Now we could fix that to
reliably detect cases where the MCV matches represented all the
matches, but it's still not entirely obvious what to do when they
don't.

What I'm considering is an algorithm where we simultaneously compute 3 things:

simple_sel - The result we would get without multivariate stats (*)
mcv_sel - The multivariate MCV result
hist_sel - The multivariate histogram result

(*) except that at each stage where we add a new clause to the
simple_sel value, we improve upon that estimate by factoring in a
lower bound from the multivariate stats so far, so that even if the
multivariate stats fail to generate anything at the end, we've managed
to account for some of the non-independence of the columns.

If the MCV matches represented all the matches, then at the end we
would have simple_sel = mcv_sel, and hist_sel = 0, and we'd be done.

Otherwise, we'd have simple_sel >= mcv_sel, and a possibly non-zero
hist_sel, but if we had managed to factor in both mcv_sel and hist_sel
to simple_sel at each stage as we went along, then simple_sel is the
best overall estimate we can return.

Perhaps this is not so very different from what you're currently
doing, except that with this approach we might also end up with
mcv_sel = 0 and hist_sel = 0, but still have an improved simple_sel
estimate that had accounted for some the multivariate stats available
along the way.

Regards,
Dean

Re: PG 10: could not generate random cancel key

2018-07-17 Thread Dean Rasheed

On 17 July 2018 at 14:04, Michael Paquier  wrote:
> On Tue, Jul 17, 2018 at 01:33:11PM +0100, Dean Rasheed wrote:
>> Looking for precedents elsewhere, I found [2] which does exactly that,
>> although I'm slightly dubious about the need for the for-loop there. I
>> also found a thread [3], which recommends simply doing
>>
>> if (RAND_status() == 0)
>> RAND_poll();
>>
>> which seems preferable. Attached is a patch to do this in pg_strong_random().
>
> Checking for the return result of RAND_poll() would also be good thing
> to do.  From what I read in OpenSSL code it could fail as well, and
> we could combine that with a loop attempted to feed the machinery a
> decided amount of times, just failing after successive failures.

>From what I understand from here [1], some parts of OpenSSL call
RAND_poll() once on initialisation, and that's enough to get the PRNG
going. It's not obvious that calling it multiple times would have any
benefit.

They also don't appear to bother checking the return code from
RAND_poll() [2]. If it did fail, there'd not be much you could do
anyway, so you might as well just let it continue and let RAND_bytes()
fail. In fact it may even be possible for RAND_poll() to fail, but
just do enough to cause RAND_bytes() to succeed.

Regards,
Dean


[1] https://wiki.openssl.org/index.php/Random_Numbers
[2] 
https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c

PG 10: could not generate random cancel key

2018-07-17 Thread Dean Rasheed

Last week I upgraded 15 servers from various pre-10 versions to 10.4.
At first everything looked OK, but then (around 4 days later) one of
them failed with this in the logs:

2018-07-14 01:53:35.840 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.233 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:37.245 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.553 BST  LOG:  could not generate random cancel key
2018-07-14 01:53:38.581 BST  LOG:  could not generate random cancel key
2018-07-14 01:54:43.851 BST  WARNING:  worker took too long to start; canceled
2018-07-14 01:54:43.862 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.861 BST  LOG:  could not generate random cancel key
2018-07-14 01:55:09.874 BST  LOG:  could not generate random cancel key
...

After that it would not accept any new connections until I restarted
postmaster a few hours later. Since then, it has been OK.

It was built using --with-openssl and strong random support enabled,
so it was OpenSSL's RAND_bytes() that failed for some reason. I
attempted to reproduce it with a small C program directly calling
RAND_bytes(), but it refused to fail, even if I disabled haveged and
ran my tests in an @reboot cron job. So this failure is evidently
quite rare, but the documentation for RAND_bytes() says it *can* fail
(returning 0) if it isn't seeded with enough entropy, in which case
more must be added, which we're not doing.

In addition, once it does fail, repeated calls to RAND_bytes() will
continue to fail if it isn't seeded with more data -- hence the
inability to start any new backends until after a postmaster restart,
which is not a very friendly failure mode.

The OpenSSL documentation suggests that we should use RAND_status()
[1] to check that the generator has been seeded with enough data:

RAND_status() indicates whether or not the CSPRNG has been sufficiently
seeded. If not, functions such as RAND_bytes(3) will fail.

and if not, RAND_poll() can be used to fix that:

RAND_poll() uses the system's capabilities to seed the CSPRNG using
random input obtained from polling various trusted entropy sources. The
default choice of the entropy source can be modified at build time using
the --with-rand-seed configure option, see also the NOTES section. A
summary of the configure options can be displayed with the OpenSSL
version(1) command.

Looking for precedents elsewhere, I found [2] which does exactly that,
although I'm slightly dubious about the need for the for-loop there. I
also found a thread [3], which recommends simply doing

if (RAND_status() == 0)
RAND_poll();

which seems preferable. Attached is a patch to do this in pg_strong_random().

Thoughts?

Regards,
Dean


[1] https://www.openssl.org/docs/man1.1.1/man3/RAND_status.html
[2] https://github.com/nodejs/node/blob/master/src/node_crypto.cc
[3] https://github.com/openssl/openssl/issues/4148
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
new file mode 100644
index bc7a8aa..fd5ad92
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,9 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	/* Make sure that OpenSSL's CSPRNG has been sufficiently seeded */
+	if (RAND_status() == 0)
+		(void) RAND_poll();
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;

Re: PG 10: could not generate random cancel key

2018-07-18 Thread Dean Rasheed

On 18 July 2018 at 03:17, Michael Paquier  wrote:
>> [1] https://wiki.openssl.org/index.php/Random_Numbers
>
> This quote from the wiki is scary so that's not quite clean either for
> Windows:
> "Be careful when deferring to RAND_poll on some Unix systems because it
> does not seed the generator. See the code guarded with
> OPENSSL_SYS_VXWORKS in rand_unix.c. Additionally, RAND_poll can have
> negative interactions on newer Windows platforms, so your program could
> hang or crash depending on the potential issue. See Windows Issues
> below."
>

I think that wiki page is somewhat out of date in places. Both the
Windows issues it links to seem to have been fixed a long time ago, so
I think using RAND_poll() is probably safe now, although perhaps there
are still some Unix platforms on which it won't help either.


>> [2] 
>> https://github.com/benvanik/openssl/blob/master/openssl/crypto/rand/md_rand.c
>
> This repository is outdated, on OpenSSL HEAD I am seeing this used only
> in rand_win.c.  And this commit is sort of interesting because there was
> a retry loop done with RAND_poll().  Please see this one:
> commit: c16de9d8329d41a2433d0f273c080d9d06ad7a87
> author: Dr. Matthias St. Pierre 
> date: Thu, 31 Aug 2017 23:16:22 +0200
> committer: Ben Kaduk 
> date: Wed, 18 Oct 2017 08:39:20 -0500
> Fix reseeding issues of the public RAND_DRBG
>
> apps/ocsp.c also has the wisdom to check for a failure on RAND_poll().

OK, I guess that it is possible that an older version of OpenSSL
requires RAND_poll() to be called multiple times. Here's an updated
patch doing that (with up to 8 retries, based on the old OpenSSL
code).

Regards,
Dean
diff --git a/src/port/pg_strong_random.c b/src/port/pg_strong_random.c
new file mode 100644
index bc7a8aa..a2ab8c1
--- a/src/port/pg_strong_random.c
+++ b/src/port/pg_strong_random.c
@@ -103,6 +103,37 @@ pg_strong_random(void *buf, size_t len)
 	 * When built with OpenSSL, use OpenSSL's RAND_bytes function.
 	 */
 #if defined(USE_OPENSSL_RANDOM)
+	/*
+	 * Check that OpenSSL's CSPRNG has been sufficiently seeded, and if not
+	 * add more seed data using RAND_poll().  With some older versions of
+	 * OpenSSL, it may be necessary to call RAND_poll() a number of times.
+	 */
+#define NUM_RAND_POLL_RETRIES 8
+
+	if (RAND_status() == 0)
+	{
+		int			i;
+
+		for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
+		{
+			if (RAND_poll() == 0)
+			{
+/*
+ * RAND_poll() failed to generate any seed data, which means
+ * that RAND_bytes() will probably fail.  For now, just fall
+ * through and let that happen.  XXX: maybe we could seed it
+ * some other way.
+ */
+break;
+			}
+
+			if (RAND_status() == 1)
+			{
+/* The CSPRNG is now sufficiently seeded */
+break;
+			}
+		}
+	}
 	if (RAND_bytes(buf, len) == 1)
 		return true;
 	return false;

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-15 Thread Dean Rasheed

On 15 July 2018 at 14:29, Tomas Vondra  wrote:
> It's quite unclear to me how this algorithm could reliably end up with
> hist_sel=0 (in cases where we already don't end up with that). I mean,
> if a bucket matches the conditions, then the only way to eliminate is by
> deducing that MCV already contains all the matches - and that's rather
> difficult for complex clauses ...

Ah, I didn't realise that you were using histograms for equality
clauses as well. I had assumed that they would only use the MCV stats,
as in the univariate case. Using histograms for equality seems
problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
as things stand that would end up with an estimate of half the
bucket's frequency, which seems excessive. Also, if I'm reading it
correctly, the code for histograms with not-equals will return
STATS_MATCH_PARTIAL for all but one of the buckets, which isn't great
either.

> I don't know, really. I'll have to try hacking on this a bit I guess.
> But there's one obvious question - in what order should we add the
> clauses? Does it matter at all, or what is the optimal order? We don't
> need to worry about it now, because we simply consider all clauses at
> once, but I guess the proposed algorithm is more sensitive to this.

I don't know. That's definitely one of the least satisfactory parts of
that idea.

The alternative seems to be to improve the match tracking in your
current algorithm so that it keeps more detailed information about the
kinds of matches seen at each level, and combines them appropriately.
Maybe that's possible, but I'm struggling to see exactly how. Counting
equality clauses seen on each column might be a start. But it would
also need to track inequalities, with min/max values or fractions of
the non-MCV total, or some such thing.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-17 Thread Dean Rasheed

On 16 July 2018 at 21:55, Tomas Vondra  wrote:
>
>
> On 07/16/2018 02:54 PM, Dean Rasheed wrote:
>> On 16 July 2018 at 13:23, Tomas Vondra  wrote:
>>>>> The top-level clauses allow us to make such deductions, with deeper
>>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>>> and so a match in MCV does not guarantee anything.
>>>>
>>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>>> maybe that's the best that we can do in the absence of any other
>>>> stats.
>>>>
>>> Hmmm, is that actually true? Let's consider a simple example, with two
>>> columns, each with just 2 values, and a "perfect" MCV list:
>>>
>>> a | b | frequency
>>>---
>>> 1 | 1 | 0.5
>>> 2 | 2 | 0.5
>>>
>>> And let's estimate sel(a=1 & b=2).
>>
>> OK.In this case, there are no MCV matches, so there is no lower bound (it's 
>> 0).
>>
>> What we could do though is also impose an upper bound, based on the
>> sum of non-matching MCV frequencies. In this case, the upper bound is
>> also 0, so we could actually say the resulting selectivity is 0.
>>
>
> Hmmm, it's not very clear to me how would we decide which of these cases
> applies, because in most cases we don't have MCV covering 100% rows.
>
> Anyways, I've been thinking about how to modify the code to wort the way
> you proposed (in a way sufficient for a PoC). But after struggling with
> it for a while it occurred to me it might be useful to do it on paper
> first, to verify how would it work on your examples.
>
> So let's use this data
>
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,5);
> insert into foo select 1,2 from generate_series(1,4);
> insert into foo select 1,x/10 from generate_series(30,25) g(x);
> insert into foo select 2,1 from generate_series(1,3);
> insert into foo select 2,2 from generate_series(1,2);
> insert into foo select 2,x/10 from generate_series(30,50) g(x);
> insert into foo select 3,1 from generate_series(1,1);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,60) g(x);
> insert into foo select x,x/10 from generate_series(4,75) g(x);
>
> Assuming we have perfectly exact statistics, we have these MCV lists
> (both univariate and multivariate):
>
>   select a, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a order by 2 desc;
>
>  a| count  | frequency
>   ++---
> 3 | 614998 |0.2727
> 2 | 549971 |0.2439
> 1 | 339971 |0.1508
>  1014 |  1 |0.
> 57220 |  1 |0.
> ...
>
>   select b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by b order by 2 desc;
>
>  b| count | frequency
>   +---+---
> 1 | 90010 |0.0399
> 2 | 65010 |0.0288
> 3 |31 |0.
> 7 |31 |0.
>...
>
>   select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
> from foo group by a, b order by 3 desc;
>
>  a|   b| count | frequency
>   ++---+---
> 1 |  1 | 5 |0.0222
> 1 |  2 | 4 |0.0177
> 2 |  1 | 3 |0.0133
> 2 |  2 | 2 |0.0089
> 3 |  1 | 1 |0.0044
> 3 |  2 |  5000 |0.0022
> 2 |  12445 |10 |0.
> ...
>
> And let's estimate the two queries with complex clauses, where the
> multivariate stats gave 2x overestimates.
>
> SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
> -- actual 9, univariate: 24253, multivariate: 181091
>
>univariate:
>
>  sel(a=1) = 0.1508
>  sel(b=1) = 0.0399
>  sel(b=2) = 0.0288
>  sel(b=1 or b=2) = 0.0673
>
>multivariate:
>  sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)
>
> The second multivariate estimate comes from assuming only the first 5
> items make it to the multivariate MCV list (covering 6.87% of the data)
> and extrapolating the selectivity to the non-MCV data too.
>
> (Notice it's about 2x the actual selectivity, so

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-16 Thread Dean Rasheed

On 16 July 2018 at 13:23, Tomas Vondra  wrote:
>>> The top-level clauses allow us to make such deductions, with deeper
>>> clauses it's much more difficult (perhaps impossible). Because for
>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>> b=2)) it's not that simple, because there may be multiple combinations
>>> and so a match in MCV does not guarantee anything.
>>
>> Actually, it guarantees a lower bound on the overall selectivity, and
>> maybe that's the best that we can do in the absence of any other
>> stats.
>>
> Hmmm, is that actually true? Let's consider a simple example, with two
> columns, each with just 2 values, and a "perfect" MCV list:
>
> a | b | frequency
>---
> 1 | 1 | 0.5
> 2 | 2 | 0.5
>
> And let's estimate sel(a=1 & b=2).

OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).

What we could do though is also impose an upper bound, based on the
sum of non-matching MCV frequencies. In this case, the upper bound is
also 0, so we could actually say the resulting selectivity is 0.

Regards,
Dean

Re: PG 10: could not generate random cancel key

2018-07-18 Thread Dean Rasheed

On 18 July 2018 at 14:01, Michael Paquier  wrote:
> Thanks for the updated version.  This looks safer to me.  It is possible
> to simplify the code by removing the external RAND_status() call and
> check for RAND_status() first in the loop as per the attached.

OK, thanks.

Barring any further comments, I'll push and back-patch this to v10
later this week.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-09-04 Thread Dean Rasheed

On 3 September 2018 at 00:17, Tomas Vondra  wrote:
> Hi,
>
> Attached is an updated version of the patch series, adopting a couple of
> improvements - both for MCV lists and histograms.
>
>
> MCV
> ---
>
> For the MCV list part, I've adopted the approach proposed by Dean, using
> base selectivity and using it to correct the non-MCV part. I agree the
> simplicity of the approach is a nice feature, and it seems to produce
> better estimates. I'm not sure I understand the approach perfectly, but
> I've tried to add comments explaining how it works etc.
>

Cool. Looking at this afresh after some time away, it still looks like
a reasonable approach, and the test results are encouraging.

In mcv_clauselist_selectivity(), you raise the following question:

if (matches[i] != STATS_MATCH_NONE)
{
/* XXX Shouldn't the basesel be outside the if condition? */
*basesel += mcv->items[i]->base_frequency;
s += mcv->items[i]->frequency;
}

The reason it needs to be inside the "if" block is that what it's
computing is the base (independent) selectivity of the clauses found
to match the MCV stats, so that then in
statext_clauselist_selectivity() it can be used in the following
computation:

/* Estimated selectivity of values not covered by MCV matches */
other_sel = simple_sel - mcv_basesel;

to give an estimate for the other clauses that aren't covered by the
MCV stats. So I think the code is correct as it stands, but if you
think it isn't clear enough, maybe a comment update is in order.

The assumption being made is that mcv_basesel will cancel out the part
of simple_sel that is due to clauses matching the MCV stats, so that
we can then just add on the MCV selectivity. Of course that's only an
approximation, and it won't be exact -- partly due to the fact that
simple_sel and mcv_basesel are potentially computed using different
sample rows, and so will differ in the MCV region, and partly because
of second-order effects arising from the way that selectivities are
combined in clauselist_selectivity_simple(). Maybe that's something
that can be improved upon, but it doesn't seem like a bad initial
approximation.

> I've also changed how we build the MCV lists, particularly how we decide
> how many / which items store in the MCV list. In the previous version
> I've adopted the same algorithm we use for per-column MCV lists, but in
> certain cases that turned out to be too restrictive.
>
> Consider for example a table with multiple perfectly correlated columns,
> with very few combinations. That is, something like this:
>
> CREATE TABLE t (a int, b int);
>
> INSERT INTO t SELECT mod(i,50), mod(i,50)
>   FROM generate_series(1,1e6) s(i);
>
> CREATE STATISTICS s (mcv) ON a,b FROM t;
>
> Now, the data distribution is very simple - uniform, with 50 distinct
> combinations, each representing 2% of data (and the random sample should
> be pretty close to that).
>
> In these cases, analyze_mcv_list decides it does not need any MCV list,
> because the frequency for each value is pretty much 1/ndistinct. For
> single column that's reasonable, but for multiple correlated columns
> it's rather problematic. We might use the same ndistinct approach
> (assuming we have the ndistinct coefficients), but that still does not
> allow us to decide which combinations are "valid" with respect to the
> data. For example we can't decide (1,10) does not appear in the data.
>
> So I'm not entirely sure adopting the same algorithm analyze_mcv_list
> algorithm both for single-column and multi-column stats. It may make
> sense to keep more items in the multi-column case for reasons that are
> not really valid for a single single-column.
>
> For now I've added a trivial condition to simply keep all the groups
> when possible. This probably needs more thought.
>

Ah, this is a good point. I think I see the problem here.

analyze_mcv_list() works by keeping those MCV entries that are
statistically significantly more frequent than the selectivity that
would have otherwise been assigned to the values, which is based on
ndistinct and nullfrac. That's not really right for multivariate stats
though, because the selectivity that would be assigned to a
multi-column value if it weren't in the multivariate MCV list is
actually calculated using the product of individual column
selectivities. Fortunately we now calculate this (base_frequency), so
actually I think what's needed is a custom version of
analyze_mcv_list() that keeps MCV entries if the observed frequency is
statistically significantly larger than the base frequency, not the
ndistinct-based frequency.

It might also be worthwhile doing a little more work to make the
base_frequency values more consistent with the way individual column
selectivities are actually calculated -- currently the patch always
uses the observed single-column frequencies to calculate the base
frequencies, but actually the univariate stats would only do that for
a subset of the

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-13 Thread Dean Rasheed

On 24 June 2018 at 20:45, Tomas Vondra  wrote:
> Attached is a rebased version of this patch series, mostly just fixing
> the breakage caused by reworked format of initial catalog data.
>
> Aside from that, the MCV building now adopts the logic introduced by
> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
> pretty good and I don't see why multi-column MCV lists should use
> something special.

Agreed.

> I'm sure there are plenty of open questions to discuss, particularly
> stuff related to combining the various types of statistics to the final
> estimate (a lot of that was already improved based on Dean's reviews).

Yes, that's definitely one of the trickier parts of this. I don't
think that the current algorithm is ideal as it stands. In particular,
the way that it attempts to handle complex combinations of clauses
doesn't look right. I think mcv_clauselist_selectivity() and
histogram_clauselist_selectivity() are plausibly correct, but the way
that the resulting selectivities are combined in
statext_clauselist_selectivity() doesn't seem right. In particular,
the use of estimate_equality_groups() to count "nmatches" and
"fullmatch" only takes into account top-level equality clauses, so it
will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which
might be fully covered by, say, the MCV stats. Consider, for example,
the following simple test case:

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,5);
insert into foo select 1,2 from generate_series(1,4);
insert into foo select 1,x/10 from generate_series(30,25) g(x);
insert into foo select 2,1 from generate_series(1,3);
insert into foo select 2,2 from generate_series(1,2);
insert into foo select 2,x/10 from generate_series(30,50) g(x);
insert into foo select 3,1 from generate_series(1,1);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,60) g(x);
insert into foo select x,x/10 from generate_series(4,75) g(x);

create statistics foo_mcv_ab (mcv) on a,b from foo;
analyse foo;

explain analyse select * from foo where a=1 and b=1;
 -- Actual rows: 5, Estimated: 52690 (14149 without MV-stats)

explain analyse select * from foo where a=1 and b=2;
 -- Actual rows: 4, Estimated: 41115 (10534 without MV-stats)

explain analyse select * from foo where a=1 and (b=1 or b=2);
 -- Actual rows: 9, Estimated: 181091 (24253 without MV-stats)

explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2);
 -- Actual rows: 14, Estimated: 276425 (56716 without MV-stats)

In the first 2 queries the multivariate MCV stats help a lot and give
good estimates, but in the last 2 queries the estimates are around
twice as large as they should be, even though good MCV stats are
available on those specific values.

The tricky thing is to work out how to correctly combine the various
stats that are available. In the above examples, the selectivity
returned by mcv_clauselist_selectivity() would have been basically
correct, but since it will have not been identified as a fullmatch and
some non-equality clauses will have been seen at the top-level (the OR
clauses), it ends up adding on additional selectivities from
clauselist_selectivity().

I think perhaps it might be better not to attempt to combine the
*overall* selectivity returned by mcv_clauselist_selectivity() with
that returned by clauselist_selectivity(), but rather merge the logic
of these two functions together into a single traversal of the
clause-tree. That way, the various individual selectivities can be
combined on a clause-by-clause basis to give the best running total
based on the available information. Even when the multivariate stats
are incomplete, they may still provide a useful lower bound on the
selectivity. If/when all MCV columns have been matched exactly, that
lower bound might turn into the appropriate overall result, if there
is a matching MCV entry. For example, suppose that there are MCV stats
on 3 columns a,b,c and a WHERE clause like (a=1 AND b=2 AND c=3). You
might process that something like:

* Get sel(a=1) using the normal univariate stats
* Update the multivariate MCV stats match bitmap based on a=1
* Get sel(b=2) using the normal univariate stats
* Compute the total selectivity assuming independence:
total_sel = sel(a=1)*sel(b=2)
* Update the multivariate MCV stats match bitmap based on b=2
* Compute the multivariate MCV selectivity so far:
mcv_sel = Sum of MCV frequencies that match so far
* Use that as a lower bound:
total_sel = Max(total_sel, mcv_sel)
* Get sel(c=3) using the normal univariate stats
* Compute the new total selectivity assuming independence:
total_sel *= sel(c=3)
* Update the multivariate MCV stats match bitmap based on c=3
* Compute the new multivariate MCV selectivity:
mcv_sel = Sum of MCV frequencies that now match
* Use that as a new lower bound:
total_sel =

Re: Bogus use of canonicalize_qual

2018-03-11 Thread Dean Rasheed

On 10 March 2018 at 20:21, Tom Lane  wrote:
> I wrote:
>> Whilst fooling about with predtest.c, I noticed a rather embarrassing
>> error.  Consider the following, rather silly, CHECK constraint:
>> ...
>> So, what to do?  We have a few choices, none ideal:
>
> I'd been assuming that we need to back-patch a fix for this, but after
> further reflection, I'm not so sure.  The bug is only triggered by fairly
> silly CHECK constraints, and given that it's been there a long time (at
> least since 9.2 according to my tests) without any field reports, it seems
> likely that nobody is writing such silly CHECK constraints.
>
> If we suppose that we only need to fix it in HEAD, the most attractive
> answer is to add a parameter distinguishing WHERE and CHECK arguments
> to canonicalize_qual.  That allows symmetrical simplification of constant-
> NULL subexpressions in the two cases, and the fact that the caller now
> has to make an explicit choice of WHERE vs CHECK semantics might help
> discourage people from applying the function in cases where it's not
> clear which one applies.  PFA a patch that does it like that.
>

I agree that this looks like the best choice, but it feels a little
unsatisfactory to not back-patch a fix for such a glaring bug. You
could perhaps leave the signature of canonicalize_qual() the same, but
add a new canonicalize_check() function, and make both thin wrappers
on top of a local function accepting the is_check parameter.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-06 Thread Dean Rasheed

On 6 March 2018 at 08:51, John Naylor <jcnay...@gmail.com> wrote:
> On 3/5/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>> Attached is an updated patch.
> Nice. The results look good.

Thanks for the review.

> I agree it should be in a separate function. As for that large
> comment, I spent some time pouring over it to verify the math and make
> sure I understand it. I see a couple points where it might be a bit
> confusing to someone reading this code for the first time:
>
> +* This bound is at most 25, and approaches 0 as n approaches 0 or N. 
> The
> +* case where n approaches 0 cannot happen in practice, since the 
> sample
> +* size is at least 300.
>
> I think the implication is that the bound cannot dip below 10 (the
> stated minimum to justify using techniques intended for a normal
> distributed sample), so there's no need to code a separate clamp to
> ensure 10 values. That might be worth spelling out explicitly in the
> comment.
>

Perhaps I should really say that n can't be less than 300 unless N is
too, in which case n=N. So the only case where we need to worry about
the bound approaching zero is when n --> N, and we're sampling the
entire table, or almost all of it. I'll see if I can re-word that to
be clearer.

> +* size is at least 300.  The case where n approaches N corresponds to
> +* sampling the whole the table, in which case it is reasonable to 
> keep
> +* the whole MCV list (have no lower bound), so it makes sense to 
> apply
> +* this formula for all inputs, even though the above derivation is
> +* technically only valid when the right hand side is at least around 
> 10.
>
> It occurred to me that if the table cardinality is just barely above
> the sample size, we could get a case where a value sampled only once
> could slip into the MCV list. With the default stat target, this would
> be tables with between 30,000 and 31,500 rows.

(actually I think that's 31250 rows, or more generally when we sample
more than around 96% of the table)

> I couldn't induce this
> behavior, so I looked at the logic that identifies MCV candidates, and
> saw a test for
>
> if (dups_cnt > 1)
>
> If I'm reading that block correctly, than a value sampled once will
> never even be considered for the MCV list, so it seems that the
> defacto lower bound for these tables is two. It might be worth
> mentioning that in the comment.
>
> It also means that this clamp on HEAD
>
> -   if (mincount < 2)
> -   mincount = 2;
>
> is dead code.
>

Yes, in compute_scalar_stats(), each value is guaranteed to have been
seen at least twice. However, looking at compute_distinct_stats(),
that's not the case. I'm not sure if that matters.

When we have sampled 96% of the table, any estimate we produce based
on the number of times a value has been seen in the sample is likely
to be almost exact, even if it has only been seen once or twice. So
the estimates from the MCV list will all be good, but I suspect in
this case the estimates for all other values that didn't fit in the
MCV list will also be good, so it may not matter whether or not we
keep those MCV items. My instinct is to keep them, on the grounds that
the more information the planner has, the better.

>>   mcv_cutoff - The relative standard error cutoff percentage used (10,
>> 15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
>> against HEAD.
>
> I'm not quite following the negative numbers for HEAD. They're all the
> equivalent, right? Just a label for convenience to make sure you ran
> the same number of tests?
>

Yes, they should all be equivalent. They just reflect the way I ran
the tests -- HEAD vs the patch with a cutoff of 10%, HEAD vs the patch
with a cutoff of 15%, and so on. So I ended up running it against HEAD
9 times, and I didn't want to throw any of that data away. It's useful
for getting a feel for the scope of random variations between test
runs.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-04-07 Thread Dean Rasheed

On 7 April 2018 at 15:12, Bruce Momjian  wrote:
> Uh, where are we on this patch?  It isn't going to make it into PG 11?
> Feature development for this has been going on for years.

Unfortunately, I think there's no way that this will be ready for
PG11. So far, I have only read parts of the patch, focusing mainly on
the planner changes, and how it will make use of the new statistics. I
think there are still issues to work out in that area, although I
haven't read the latest version yet, I have some doubts about the new
approach described.

Looking back at the history of this patch, it appears to have moved
through all 4 PG11 commitfests with fairly minimal review, never
becoming Ready for Committer. That's a real shame because I agree that
it's an important feature, but IMO it's not yet in a committable
state. I feel bad saying that, because the patch hasn't really had a
fair chance to-date, despite Tomas' efforts and quick responses to all
review comments.

At this stage though, all I can say is that I'll make every effort to
keep reviewing it, and I hope Tomas will persist, so that it has a
better chance in PG12.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-27 Thread Dean Rasheed

On 27 March 2018 at 01:36, Tomas Vondra  wrote:
> BTW I think there's a bug in handling the fullmatch flag - it should not
> be passed to AND/OR subclauses the way it is, because then
>
> WHERE a=1 OR (a=2 AND b=2)
>
> will probably set it to 'true' because of (a=2 AND b=2). Which will
> short-circuit the statext_clauselist_selectivity, forcing it to ignore
> the non-MCV part.
>

I'm not sure that's true. Won't the outer call to
mcv_update_match_bitmap() overwrite the value of fullmatch returned by
the nested call, and set fullmatch to false because it has only seen 1
attribute equality match? I think that's the correct result, but I
think that's just luck.

The dubious part is the way fullmatch is calculated for OR clauses --
I think for an OR clause we want to know the attributes matched in
*every* subclause, rather than in *any* subclause, as we do for AND.
So I think the only way an OR clause at the top-level should return a
full match is if every sub-clause was a full match, for example:

  WHERE (a=1 AND b=2) OR (a=2 AND b=1)

But then consider this:

  WHERE a=1 AND (b=1 OR b=2)

That should also potentially be a full match, but that can only work
if mcv_update_match_bitmap() returned the set of matching attributes
(eqmatches), rather than fullmatch, so that it can be merged
appropriately in the caller. So for an OR clause, it needs to return
eqmatches containing the list of attributes for which every sub-clause
matched with equality against the MCV list, and in an outer AND clause
that can be added to the outer eqmatches list, which is the list of
attributes for which any sub-clause matched with equality.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-27 Thread Dean Rasheed

On 27 March 2018 at 01:36, Tomas Vondra  wrote:
> 4) handling of NOT clauses in MCV lists (and in histograms)
>
> The query you posted does not fail anymore...
>

Ah, it turns out the previous query wasn't actually failing for the
reason I thought it was -- it was failing because it had a
ScalarArrayOpExpr that was being passed to
mcv_clauselist_selectivity() because of the wrong list being passed to
it. I could see from the code that a NOT clause would have tripped it
up, but most NOT clauses actually get rewritten by negate_clause() so
they end up not being NOT clauses.

One way to get a NOT clause, is with a boolean column, and this
reveals another couple of problems:

drop table if exists foo;
create table foo(a int, b boolean);
insert into foo values(1,true);
insert into foo values(1,true);
insert into foo values(1,false);
create statistics foo_mcv_ab (mcv) on a,b from foo;
analyse foo;

select * from foo where a=1 and b;
ERROR:  unknown clause type: 99

This fails because the clause is now a Var, which
statext_is_compatible_clause() lets through, but
mcv_clauselist_selectivity() doesn't support. So it's important to
keep those 2 functions in sync, and it might be worth having comments
in each to emphasise that.

And, if a NOT clause is used:

select * from foo where a=1 and not b;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

This is an Assert failure in mcv_update_match_bitmap()'s BoolExpr
handling block:

Assert(bool_clauses != NIL);
Assert(list_length(bool_clauses) >= 2);

The first of those Asserts is actually redundant with the second, but
the second fails because a NOT clause always only has one argument.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-27 Thread Dean Rasheed

On 27 March 2018 at 14:58, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> On 27 March 2018 at 01:36, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
>> 4) handling of NOT clauses in MCV lists (and in histograms)
>>
>> The query you posted does not fail anymore...
>>
> Ah, it turns out the previous query wasn't actually failing for the
> reason I thought it was -- it was failing because it had a
> ScalarArrayOpExpr that was being passed to
> mcv_clauselist_selectivity() because of the wrong list being passed to
> it. I could see from the code that a NOT clause would have tripped it
> up, but most NOT clauses actually get rewritten by negate_clause() so
> they end up not being NOT clauses.
>

Thinking about that some, I think that the only NOT clauses this needs
to actually worry about are NOTs of boolean Vars. Anything else that
this code supports will have been transformed into something other
than a NOT before reaching this point. Thus it might be much simpler
to handle that as a special case in statext_is_compatible_clause() and
mcv_update_match_bitmap(), rather than trying to support general NOT
clauses, and going through a recursive call to
mcv_update_match_bitmap(), and then having to merge bitmaps. NOT of a
boolean Var could then be treated just like var=false, setting the
appropriate attribute match entry if it's found in the MCV list. This
would allow clauses like (a=1 and NOT b) to be supported, which I
don't think currently works, because fullmatch won't get set.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-26 Thread Dean Rasheed

On 18 March 2018 at 23:57, Tomas Vondra  wrote:
> Attached is an updated version of the patch series, addressing issues
> pointed out by Alvaro.

I'm just starting to look at this now, and I think I'll post
individual comments/questions as I get to them rather than trying to
review the whole thing, because it's quite a large patch. Apologies if
some of this has already been discussed.

Looking at the changes to UpdateStatisticsForTypeChange():

+   memset(nulls, 1, Natts_pg_statistic_ext * sizeof(bool));

why the "1" there -- is it just a typo?

A wider concern I have is that I think this function is trying to be
too clever by only resetting selected stats. IMO it should just reset
all stats unconditionally when the column type changes, which would be
consistent with what we do for regular stats.

Consider, for example, what would happen if a column was changed from
real to int -- all the data values will be coerced to integers, losing
precision, and any ndistinct and dependency stats would likely be
completely wrong afterwards. IMO that's a bug, and should be
back-patched independently of these new types of extended stats.

Thoughts?

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Dean Rasheed

On 28 March 2018 at 15:50, Tomas Vondra  wrote:
> After thinking about this a bit more, I'm not sure if updating the info
> based on recursive calls makes sense. The fullmatch flag was supposed to
> answer a simple question - can there be just a single matching item?
>
> If there are equality conditions on all columns, there can be just a
> single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
> then we don't need to inspect the non-MCV part.
>
> But handling this in recursive manner breaks this entirely, because with
> something like
>
>(a=1) AND (b=1 OR b=2)
>
> you suddenly can have multiple matching items. Which makes the fullmatch
> flag somewhat useless.
>
> So I think we should be looking at top-level equality clauses only, just
> like number_of_groups() does.
>

I'm not quite sure what you mean by that, but it sounds a bit limiting
in terms of the kinds of user queries that would be supported.

> I think we can remove the fullmatch flag from mcv_update_bitmap
> entirely. All we need to know is the presence of equality clauses and
> whether there was a match in MCV (which we know from s1 > 0.0).
>

I agree with removing the fullmatch flag, but I don't think we
actually need to know about the presence of equality clauses:

The way that mcv_update_bitmap() recursively computes the set of
matching MCVs seems to be correct. That gives us a value (call it
mcv_matchsel) for the proportion of the table's rows that are in the
MCV list and satisfy the clauses in stat_clauses.

We can also estimate that there are (1-mcv_totalsel)*N rows that are
not in the MCV list, for which the MCV stats therefore tell us
nothing. The best way to estimate those rows would seem to be to use
the logic from the guts of clauselist_selectivity(), without
consulting any extended MCV stats (but still consulting other extended
stats, I think). Doing that would return a selectivity value (call it
nonmcv_sel) for those remaining rows. Then a reasonable estimate for
the overall selectivity would seem to be

  mcv_matchsel + (1-mcv_totalsel) * nonmcv_sel

and there would be no need for mcv_update_bitmap() to track eqmatches
or return fullmatch, and it wouldn't actually matter whether or not we
had equality clauses or if all the MCV columns were used.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-26 Thread Dean Rasheed

On 26 March 2018 at 14:08, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> On 03/26/2018 12:31 PM, Dean Rasheed wrote:
>> A wider concern I have is that I think this function is trying to be
>> too clever by only resetting selected stats. IMO it should just reset
>> all stats unconditionally when the column type changes, which would
>> be consistent with what we do for regular stats.
>>
> The argument a year ago was that it's more plausible that the semantics
> remains the same. I think the question is how the type change affects
> precision - had the type change in the opposite direction (int to real)
> there would be no problem, because both ndistinct and dependencies would
> produce the same statistics.
>
> In my experience people are far more likely to change data types in a
> way that preserves precision, so I think the current behavior is OK.

Hmm, I don't really buy that argument. Altering a column's type allows
the data in it to be rewritten in arbitrary ways, and I don't think we
should presume that the statistics will still be valid just because
the user *probably* won't do something that changes the data much.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-26 Thread Dean Rasheed

On 18 March 2018 at 23:57, Tomas Vondra  wrote:
> Attached is an updated version of the patch series, addressing issues
> pointed out by Alvaro.

I've just been reading the new code in
statext_clauselist_selectivity() and mcv_clauselist_selectivity(), and
I'm having a hard time convincing myself that it's correct.

This code in statext_clauselist_selectivity() looks a bit odd:

/*
 * Evaluate the MCV selectivity. See if we got a full match and the
 * minimal selectivity.
 */
if (stat->kind == STATS_EXT_MCV)
s1 = mcv_clauselist_selectivity(root, stat, clauses, varRelid,
jointype, sjinfo, rel,
, _lowsel);

/*
 * If we got a full equality match on the MCV list, we're done (and the
 * estimate is likely pretty good).
 */
if (fullmatch && (s1 > 0.0))
return s1;

/*
 * If it's a full match (equalities on all columns) but we haven't found
 * it in the MCV, then we limit the selectivity by frequency of the last
 * MCV item. Otherwise reset it to 1.0.
 */
if (!fullmatch)
mcv_lowsel = 1.0;

return Min(s1, mcv_lowsel);

So if fullmatch is true and s1 is greater than 0, it will return s1.
If fullmatch is true and s1 equals 0, it will return Min(s1,
mcv_lowsel) which will also be s1. If fullmatch is false, mcv_lowsel
will be set to 1 and it will return Min(s1, mcv_lowsel) which will
also be s1. So it always just returns s1, no? Maybe there's no point
in computing fullmatch.

Also, wouldn't mcv_lowsel potentially be a significant overestimate
anyway? Perhaps 1 minus the sum of the MCV frequencies might be
closer, but even that ought to take into account the number of
distinct values remaining, although that information may not always be
available.

Also, just above that, in statext_clauselist_selectivity(), it
computes the list stat_clauses, then doesn't appear to use it
anywhere. I think that would have been the appropriate thing to pass
to mcv_clauselist_selectivity(). Otherwise, passing unrelated clauses
into mcv_clauselist_selectivity() will cause it to fail to find any
matches and then underestimate.

I've also come across a few incorrect/out-of-date comments:

/*
 * mcv_clauselist_selectivity
 *  Return the estimated selectivity of the given clauses using MCV list
 *  statistics, or 1.0 if no useful MCV list statistic exists.
 */

-- I can't see any code path that returns 1.0 if there are no MCV
stats. The last part of that comment is probably more appropriate to
statext_clauselist_selectivity()

/*
 * mcv_update_match_bitmap
 * [snip]
 * The function returns the number of items currently marked as 'match', and
 * ...

-- it doesn't seem to return the number of items marked as 'match'.

Then inside that function, this comment is wrong (copied from the
preceding comment):

/* AND clauses assume nothing matches, initially */
memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);

Still reading...

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-28 Thread Dean Rasheed

On 28 March 2018 at 01:34, Tomas Vondra  wrote:
> Attached is a patch fixing this. In the end I've decided to keep both
> branches - one handling boolean Vars and one for NOT clauses. I think
> you're right we can only see (NOT var) cases, but I'm not sure about that.
>
> For example, what if an operator does not have a negator? Then we can't
> transform NOT (a AND b) => (NOT a OR NOT b), I guess. So I kept this for
> now, and we can remove this later.
>

OK, but it's going to have to work harder to set "fullmatch"
correctly. If you have a boolean Var clause, which is identical to
"bool_var = true", it ought to add to "eqmatches" if true is found in
the MCV list. Likewise a boolean Var under a NOT clause is identical
to "bool_var = false", so it ought to add to "eqmatches" if false is
found in the MCV list. Both those cases would be easy to handle, if
general NOT support wasn't required, and you just special-cased "NOT
bool_var".

If you're going to handle the general case of an arbitrary clause
under a NOT, then the recursive call to mcv_update_match_bitmap()
would seem to need to know that it's under a NOT (a new "is_not"
parameter?), to invert the logic around adding to "eqmatches". That
applies to other general OpExpr's too -- for example, "NOT (box_var =
?)" won't be rewritten because there is no box_ne operator, but when
mcv_update_match_bitmap() is called recursively with the "box_var =
?", it shouldn't add to "eqmatches", despite this being an EQSEL
operator.

As mentioned before, I think this whole thing only works if
mcv_update_match_bitmap() returns the "eqmatches" list, so that if it
is called recursively, it can be merged with the caller's list. What
isn't immediately obvious to me is what happens to a NOT clause under
another NOT clause, possibly with an AND or OR in-between. Would the
"is_not" parameter just flip back to false again?

There's also an interesting question around the NullTest clause. Since
NULLs are being recorded in the MCV list, shouldn't "IS NULL" be
treated as semantically like an equality clause, and cause that
attribute to be added to "eqmatches" if NULL is found in the MCV list?

> I've also realized that the "fullmatch" flag is somewhat confused,
> because some places interpreted it as "there is equality on each
> attribute" but in fact it also required an actual MCV match.

Yes, I was having similar thoughts. I think "eqmatches" / "fullmatch"
probably just wants to track whether there was an exact comparison on
all the attributes, not whether or not the value was in the MCV list,
because the latter is already available in the "matches" bitmap.
Knowing that complete, exact comparisons happened, and it wasn't in
the MCV list, makes the "(1 - mcv_totalsel)) / otherdistinct" estimate
reasonable.

However, I don't think that tracking "eqmatches" or "fullmatch" is
sufficient for the general case. For example, for other operators like
"!=", "<", "<=", all (or maybe half) the "1 - mcv_totalsel" ought to
count towards the selectivity, plus possibly part of the MCV list
(e.g., for "<=", using the sum of the matching MCV frequencies plus
half the sum of the non-MCV frequencies might be reasonable -- c.f.
scalarineqsel()). For an OR clause, you might want to count the number
of non-MCV matches, because logically each one adds another "(1 -
mcv_totalsel)) / otherdistinct" to the total selectivity. It's not
immediately obvious how that can be made to fit into the current code
structure. Perhaps it could be made to work by tracking the overall
selectivity as it goes along. Or perhaps it could track the
count/proportion of non-MCV matches.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-03-26 Thread Dean Rasheed

On 26 March 2018 at 20:17, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> On 03/26/2018 09:01 PM, Dean Rasheed wrote:
>> Also, just above that, in statext_clauselist_selectivity(), it
>> computes the list stat_clauses, then doesn't appear to use it
>> anywhere. I think that would have been the appropriate thing to pass
>> to mcv_clauselist_selectivity(). Otherwise, passing unrelated clauses
>> into mcv_clauselist_selectivity() will cause it to fail to find any
>> matches and then underestimate.
>
> Will check.
>

Here's a test case demonstrating this bug:

drop table if exists foo;
create table foo(a int, b int, c int);
insert into foo select 0,0,0 from generate_series(1,10);
insert into foo select 1,1,1 from generate_series(1,1);
insert into foo select 2,2,2 from generate_series(1,1000);
insert into foo select 3,3,3 from generate_series(1,100);
insert into foo select x,x,x from generate_series(4,1000) g(x);
insert into foo select x,x,x from generate_series(4,1000) g(x);
insert into foo select x,x,x from generate_series(4,1000) g(x);
insert into foo select x,x,x from generate_series(4,1000) g(x);
insert into foo select x,x,x from generate_series(4,1000) g(x);
analyse foo;
explain analyse select * from foo where a=1 and b=1 and c=1;
create statistics foo_mcv_ab (mcv) on a,b from foo;
analyse foo;
explain analyse select * from foo where a=1 and b=1 and c=1;

With the multivariate MCV statistics, the estimate gets worse because
it passes the c=1 clause to mcv_clauselist_selectivity(), and nothing
matches.

There's also another bug, arising from the fact that
statext_is_compatible_clause() says that NOT clauses are supported,
but mcv_clauselist_selectivity() doesn't support them. So with the
above table:

select * from foo where (a=0 or b=0) and not (b in (1,2));
ERROR:  unknown clause type: 111

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 17 March 2018 at 17:16, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> Using the calculator above, you can see that the distribution is
> fairly normal-like, but with a noticeable positive skew. The 2-stddev
> interval is 0.6 to 9.4, and according to the calculator the
> probability of the value being less than or equal to 1 is in the
> ballpark of the 2.5% figure expected. So even with just 5 occurrences
> in the sample, it's fairly close to a normal distribution.
>

One thing this does illustrate is that the hypergeometric distribution
is a discrete distribution and there can be quite large jumps in the
probability from one value to the next, so care may be needed when
approximating it with a continuous distribution. The standard
technique used to handle this is to apply what is known as a
continuity correction factor. Suppose that X is a random variable with
a discrete hypergeometric distribution, and Y is a continuous normal
distribution, with the same mean and variance, being used to
approximate X. Then P(X>i) for some integer i is the same as
P(X>=i+1), because X can only be integer-valued. The idea is then that
you can use P(Y>i+0.5) to get a fairly good approximation to P(X>i).
That would correspond to adding 0.5 to the right-hand side of the
current test, i.e.,

if (mcv_counts[num_mcv - 1] > selec * samplerows + 2 * stddev + 0.5)
=> Common enough to be included in MCV-list

A lot of the time that won't make much difference, except when dealing
with the smaller counts at the tail end of the MCV list, where it
might help avoid the too-many-mcvs problem, so I think it's worth
trying out.

Apparently, in textbooks, an interval like the mean +/- 2*stddev is
known as a Wald confidence interval, and the mean +/- (2*devdev+0.5)
is the continuity-corrected Wald interval, so it's probably worth
mentioning that in the comments. They are generally regarded as quite
crude approximations for hypergeometric distributions, and there's
quite a bit of research around establishing more accurate confidence
intervals for this kind of distribution, but the formulae involved
tend to be quite complicated, whereas the Wald interval has the
advantage of being very simple. In this context, I don't think we need
to establish a particularly accurate confidence interval. We just want
to be able to say that the value is probably more common than the
non-mcv values, without being too rigorous about what "probably"
means, as long as it works in practice to discount values that just
happen to be a bit more common in the sample, but aren't actually more
common in the table as a whole.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 16 March 2018 at 15:26, Tomas Vondra  wrote:
> Actually, one question - when deciding whether to keep the item in the
> MCV list, analyze_mcv_list only compares it's frequency with an average
> of the rest. But as we're removing items from the MCV list, the average
> frequency of the non-MCV items is growing (we're removing items with
> higher and higher frequencies). That means the estimates for the least
> common items will get higher and higher - essentially overestimates. So,
> couldn't/shouldn't analyze_mcv_list consider this too?
>

Yes, that's the intention. At the start, sumcount is set to the count
of all but the last (least common) MCV item, so it can estimate the
frequency of the non-MCV items if the last MCV item were to be
removed. Then each time through the loop, sumcount is decreased by the
removed item's count, increasing the estimated frequency of the
non-MCV items accordingly.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 17 March 2018 at 18:40, Tomas Vondra  wrote:
> Currently, analyze_mcv_list only checks if the frequency of the current
> item is significantly higher than the non-MCV selectivity. My question
> is if it shouldn't also consider if removing the item from MCV would not
> increase the non-MCV selectivity too much.
>

Oh, I see what you're saying. In theory, each MCV item we remove is
not significantly more common than the non-MCV items at that point, so
removing it shouldn't significantly increase the non-MCV selectivity.
It's possible the cumulative effect of removing multiple items might
start to add up, but I think it would necessarily be a slow effect,
and I think it would keep getting slower and slower as more items are
removed -- isn't this equivalent to constructing a sequence of numbers
where each number is a little greater than the average of all the
preceding numbers, and ends up virtually flat-lining.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

On 13 March 2018 at 08:39, John Naylor  wrote:
>> Also, this is common enough that in fact that distribution
>> can be reasonably approximated by a normal distribution.
>
> For the archives, because it's typically seen 10 times in the sample,
> per the rule of thumb mention upthread?
>

Actually, I think that the rule of thumb of at least 10 instances in
the sample for a normal approximation is probably too strict.

Looking around, values as low as 5 seem to be quite commonly used. Of
course there is no right answer here, it depends on what you're doing
with it, and how close you want the normal distribution to be to the
hypergeometric one. For our purposes, I don't think we really need it
to be that close. We just want to be able to justify statements like
"the value is *probably* more common than X, and it was unlikely that
that was just random sampling error".

>> Making use of
>> the non-MCV-based estimate allows us to do better -- if the MCV-based
>> estimate is statistically significantly higher than the non-MCV-based
>> estimate, then it would almost certainly be better to include that
>> value in the MCV list, even if its spread is quite large.
>
> Because we can treat the sample as normal, testing for > 2 standard
> deviations means that we're "~95% sure" it's more common in the table
> than the non-MCV selectivity would suggest, right? (I realize I'm not
> using technically accurate language)
>

Actually, it's more like 97.5% (remember the normal distribution is
symmetric, so there's around 2.5% below the 2-stddev interval, around
95% in it, and another 2.5% above it), but that's just nit-picking.

There are several nice online calculators for playing around with
hypergeometric distributions, such as
http://keisan.casio.com/exec/system/1180573201

Consider this distribution:

N = 100
n = 1
K = 500
Mean = n * K / N = 5
Variance = (K / N) * (1 - K / N) * n * (N - n) / (N - 1) = 4.9
Stddev = sqrt(Variance) = 2.2

Using the calculator above, you can see that the distribution is
fairly normal-like, but with a noticeable positive skew. The 2-stddev
interval is 0.6 to 9.4, and according to the calculator the
probability of the value being less than or equal to 1 is in the
ballpark of the 2.5% figure expected. So even with just 5 occurrences
in the sample, it's fairly close to a normal distribution.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-22 Thread Dean Rasheed

On 19 March 2018 at 16:59, John Naylor <jcnay...@gmail.com> wrote:
> On 3/19/18, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
>> As promised, here is a new patch, with comment updates, per John and
>> Tomas' suggestions, plus the continuity correction, which seemed just
>> about worthwhile.
>
> Great. I'm happy with the behavior of the patch. I've marked it ready
> for committer.
>

Thanks for testing. I also did more testing with tables 10-20 times as
large and I was happy with the results, so I have pushed this.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-18 Thread Dean Rasheed

On 18 March 2018 at 12:24, John Naylor  wrote:
> Over the weekend I tried out a test to measure:
> -The size of the MCV list
> -The ratio between actual and estimated cardinality of the least
> common value in the MCV list.
> -The ratio between actual and estimated cardinality of the most common
> value not in the MCV list.

Nice tests!

> Summary for non-uniform distributions:
> Note: There's a lot of variation in the estimation of the most common
> non-MCV. About 1% of all ANALYZE runs apparently failed to sample one
> of the good candidates for the MCV list, resulting in huge
> underestimates. It didn't matter what patch was applied. For future
> tests, I'll do more runs and measure a truncated mean. Looking at the
> number of MCVs is much more stable, but unlike the uniform
> distribution case, we don't have an intuitive feel for what that
> number should be. That said,
>
> -The V2 patch is somewhat better than RSE and much better than master
> at estimating the most common value not in the MCV list.

Yes, that matches my observations too. It stems from the fact that the
V2 patch tends to produce more MCVs than the RSE patch (which in turn
tends to produce more MCVs than master) for non-uniform distributions.
More MCVs then improve the estimates for non-MCVs.

> -Bumping the sample stddev threshold to 3 seems to behave similarly to
> the RSE patch, maybe slightly better.

In a way, that's probably not surprising. The two algorithms are
closely related. If the non-MCV frequencies are quite low, the v2
patch with a 3-SD threshold behaves like the RSE patch with a 33% RSE
cutoff. The 20% RSE cutoff patch is mathematically equivalent to the
v2 patch with a 5-SD threshold and the non-MCV frequencies all set to
zero, if that makes sense.

> Summary for uniform distributions:
> Note 1: Using the average is not really adequate for testing
> under/overestimation of values in my test setup, since I set the
> estimation ratio to zero if there was either an empty MCV list or a
> completely full list. In future runs, I'll drill down a bit more
> precisely.
> That said, there didn't seem to be a huge amount variation between the
> patches as far as either of the ratios I was testing. Looking at the
> number of MCVs is much better anyway, since we know we want either
> none or everything.
>
> Note 2: All patches fully populated the list when all the values fit
> in the MCV list. For the rest of the cases:
>
> -The RSE patch is not much better than master at excluding MCVs.

To make sense of that, you need to look at the errors in the
estimates. The RSE patch won't exclude MCVs if it thinks they will
give reasonably accurate estimates, so may produce fully populated MCV
lists for uniform distributions even when all the distinct values
don't fit. That's not ideal, but it isn't necessarily bad. In my
previous testing I found that it was good at excluding MCVs in cases
where they gave inaccurate estimates.

> -The V2 patch is much better than either of these, but still not ideal
> (up to 30)
> -The V2 patch with a cutoff of 2.5 severely cut down the MCVs (at most 3).
> -With a cutoff of 3.0, all are empty.

Makes sense, but I don't think we should necessarily put too much
emphasis on the uniform tests. Real-world datasets tend to be
non-uniform, and the consequences of too few MCVs tend to be worse
than too many.

I repeated these tests with a 2-SD continuity-corrected threshold and
the results fell somewhere between the 2-SD and 2.5-SD cases. My
inclination is to go with that, as something that strikes a reasonable
balance, and represents a distinct improvement over master in a number
of different areas.

I'll post something tomorrow, once I've finished wordsmithing the comments.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-17 Thread Dean Rasheed

> On 03/11/2018 10:23 AM, Dean Rasheed wrote:
>> I'm moving this back to a status of "Needs review" partly because the
>> code has changed significantly, but also because I want to do more
>> testing, particularly with larger datasets.
>>

John, Tomas,

Thanks for looking at this, and sorry for my delayed reply. The
demands of my day job keep getting in the way. I'm feeling pretty good
about this patch though. There may still be some fine tuning to do,
but it's looking like a definite improvement over HEAD, so I'm
optimistic that I'll be able to finish it soonish.

I'll post back more detailed responses to your specific points shortly.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-19 Thread Dean Rasheed

On 18 March 2018 at 22:52, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> I'll post something tomorrow, once I've finished wordsmithing the comments.
>

As promised, here is a new patch, with comment updates, per John and
Tomas' suggestions, plus the continuity correction, which seemed just
about worthwhile.

Regards,
Dean

mcv-stats-v2cc.patch
Description: Binary data

Re: MCV lists for highly skewed distributions

2018-03-01 Thread Dean Rasheed

On 1 March 2018 at 21:01, Andres Freund  wrote:
> This sounds like the patch's status of "waiting on author" isn't right,
> and it should more be ready for committer?
>

Yes, I'll take a look at it this weekend.

Regards,
Dean

Re: MCV lists for highly skewed distributions

2018-03-05 Thread Dean Rasheed

On 7 February 2018 at 15:58, Dean Rasheed <dean.a.rash...@gmail.com> wrote:
> On 7 February 2018 at 15:25, Robert Haas <robertmh...@gmail.com> wrote:
>> Do you plan to press forward with this, then, or what's
>> the next step?
>
> I plan to do more testing

TL;DR: I tested this new algorithm using 9 different relative standard
error cutoffs (10%, 15%, ... 50%) against a variety of different data
distributions, and it looks like a definite improvement over the
current algorithm, for a wide range of cutoff values. Overall, it
isn't that sensitive to the exact cutoff chosen, but it looks like a
value of 20% is a good general-purpose choice.

One of the properties of the new algorithm is that the size of the MCV
list responds better to changing the stats target, so I don't think
it's worth having a separate knob to control the cutoff percentage.
Such a knob would have a similar, but weaker effect than the stats
target knob, and thus doesn't seem worthwhile.

Attached is an updated patch. I decided to move the code that
determines the minimum number of occurrences for an MCV list item out
to a separate function. It's not much code, but there's a large
comment to explain its statistical justification, and I didn't want to
duplicate that in 2 different places (possibly to become 3, with the
multivariate MCV stats patch).

I think this is ready for commit now, so barring objections, I will do
so in the next day or so.

---

I tested using the attached python script, which tests a large number
of different data distributions, comparing the results with the patch
vs those from HEAD. It uses EXPLAIN ANALYSE to compare the estimates
against actual row counts, both for values in the MCV list, and
non-MCV values, making it possible to tell whether it would have been
better if the MCV list had been bigger or smaller.

There's a lot of random variation between tests, but by running a lot
of them, the general pattern can be seen. I ran the test 9 times, with
different MCV cutoff values in the patched code of 10%, 15%, ... 50%,
then collected all the results from the test runs into a single CSV
file to analyse using SQL. The columns in the CSV are:

  test_name - A textual description of the data distribution used in the test.

  mcv_cutoff - The relative standard error cutoff percentage used (10,
15, 20, ... 50), or -10, -15, ... -50 for the corresponding tests
against HEAD.

  stats_tgt - The stats target used (100 or 1000).

  num_mcvs - The size of the resulting MCV list.

  avg_mcv_err - The average percentage difference between estimated
and actual row counts for values in the MCV list. (There's a bit of a
fudge factor in calculating these percentages, to avoid them becoming
too large in cases where the row counts are small, but I think it's
still useful for comparison purposes.)

  max_mcv_err - The maximum percentage difference between estimated
and actual row counts for values in the MCV list.

  avg_non_mcv_err - The average percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  max_non_mcv_err - The maximum percentage difference between
estimated and actual row counts for a bunch of non-MCV values.

  avg_err - The average percentage difference between estimated and
actual row counts for all values tested.

  max_err - The maximum percentage difference between estimated and
actual row counts for all values tested.

>From this, it's possible to look for cases where the MCV list is too
large or too small. For example, looking for too-many-mcv cases
(generally the more uniform data distributions):

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 50
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
+---
-40 |41
-50 |40
-25 |39
-15 |38
-20 |38
-30 |38
-35 |38
-45 |37
-10 |37
 50 |35
 40 |35
 45 |34
 35 |33
 30 |33
 25 |25
 20 |13
 15 | 6
 10 | 3
(18 rows)

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE max_mcv_err - max_non_mcv_err > 100
 GROUP BY 1
 ORDER BY 2 DESC;

 mcv_cutoff | count
+---
-30 |17
-40 |16
-15 |15
-25 |14
-10 |14
-35 |14
-45 |13
-50 |13
-20 |12
 35 |12
 45 |11
 40 |10
 30 |10
 50 |10
 25 | 6
 10 | 2
(16 rows)
( and implicitly:
 15 | 0
 20 | 0 )

So the patched code is generally better at avoiding the too-many-mcvs
problem, but an mcv_cutoff of 30% or more may be too high.

Looking for too-few-mcv cases:

SELECT mcv_cutoff, count(*)
  FROM results
 WHERE (max_non_mcv_err - max_mcv_err > 50 AND num_mcvs < stats_tgt)
 G

Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-09-27 Thread Dean Rasheed

On 27 September 2018 at 06:12, Madeleine Thompson  wrote:
> This is my first PostgreSQL review. Hopefully I'm getting it right. I 
> independently ran into the bug in question and found that the author had 
> already written a patch. I'm happy the author wrote this.
>

Thanks for the review. And yes, this sort of review is exactly what we need.

> (1) I am not experienced with PostgreSQL internals, so I can't comment on the 
> coding style or usage of internal functions.
>
> (2) The patch applies cleanly to ce4887bd025b95c7b455fefd817a418844c6aad3. 
> "make check", "make check-world", and "make install" pass.
>
> (3) The patch addresses the numeric inaccuracy reported in the bug. (Yay!) I 
> believe the tests are sufficient to confirm that it works as intended. I 
> don't think the test coverage *before* this patch was sufficient, and I 
> appreciate the improved test coverage of the combine functions. I verified 
> the new tests manually.
>

Excellent. Thanks for testing.

> (4) The comments cite Youngs & Cramer (1971). This is a dated citation. It 
> justifies its algorithms based on pre-IEEE754 single-precision arithmetic.  
> Most notably, it assumes that floating-point division is not exact. That 
> said, the algorithm is still a good choice; it is justified now because 
> compared to the standard Welford variance, it is less prone to causing 
> pipeline stalls on a modern CPU. Maybe link to Schubert & Gentz 2018 
> (https://dl.acm.org/citation.cfm?id=3223036) instead. The new float8_combine 
> combine code is (almost) S eqn. 22; the float8_accum code is S eqn. 28.  
> float8_regr_accum and float8_regr_combine are S eqn. 22.
>

Hmm, I think that Youngs & Cramer should be cited as the original
inventors of the algorithm (which pre-dates the first version of
IEEE754 by a few years), although I'm happy to also cite Schubert &
Gentz as a more modern justification for the choice of algorithm.

Regarding the combine functions, I think perhaps Chan, Golub & LeVeque
[1] should be cited as the original inventors. I think they only did
variance, not covariance, but that's a straightforward generalisation
of their work.

[1] Updating Formulae and a Pairwise Algorithm for Computing Sample
Variances, T. F. Chan, G. H. Golub & R. J. LeVeque, COMPSTAT 1982.

> (5) I think the comment /* Watch out for roundoff error producing a negative 
> variance */ and associated check are obsolete now.
>

Ah, good point. We clearly only ever add positive quantities to Sxx
and Syy, and likewise when they are combined, so there is no way that
they can ever become negative now. Another neat consequence of this
algorithm.

I'll post an updated patch in a while.

Regards,
Dean

Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-10-03 Thread Dean Rasheed

On Thu, 27 Sep 2018 at 14:22, Dean Rasheed  wrote:
> I'll post an updated patch in a while.
>

I finally got round to looking at this again. Here is an update, based
on the review comments.

The next question is whether or not to back-patch this. Although this
was reported as a bug, my inclination is to apply this to HEAD only,
based on the lack of prior complaints. That also matches our decision
for other similar patches, e.g., 7d9a4737c2.

Regards,
Dean
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index df35557..d1e12c1
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2401,13 +2401,39 @@ setseed(PG_FUNCTION_ARGS)
  *		float8_stddev_samp	- produce final result for float STDDEV_SAMP()
  *		float8_stddev_pop	- produce final result for float STDDEV_POP()
  *
+ * The naive schoolbook implementation of these aggregates works by
+ * accumulating sum(X) and sum(X^2).  However, this approach suffers from
+ * large rounding errors in the final computation of quantities like the
+ * population variance (N*sum(X^2) - sum(X)^2) / N^2, since each of the
+ * intermediate terms is potentially very large, while the difference is often
+ * quite small.
+ *
+ * Instead we use the Youngs-Cramer algorithm [1] which works by accumulating
+ * Sx=sum(X) and Sxx=sum((X-Sx/N)^2), using a numerically stable algorithm to
+ * incrementally update those quantities.  The final computations of each of
+ * the aggregate values is then trivial and gives more accurate results (for
+ * example, the population variance is just Sxx/N).  This algorithm is also
+ * fairly easy to generalize to allow parallel execution without loss of
+ * precision (see, for example, [2]).  For more details, and a comparison of
+ * this with other algorithms, see [3].
+ *
  * The transition datatype for all these aggregates is a 3-element array
- * of float8, holding the values N, sum(X), sum(X*X) in that order.
+ * of float8, holding the values N, Sx, Sxx in that order.
  *
  * Note that we represent N as a float to avoid having to build a special
  * datatype.  Given a reasonable floating-point implementation, there should
  * be no accuracy loss unless N exceeds 2 ^ 52 or so (by which time the
  * user will have doubtless lost interest anyway...)
+ *
+ * [1] Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms,
+ * E. A. Youngs and E. M. Cramer, Technometrics Vol 13, No 3, August 1971.
+ *
+ * [2] Updating Formulae and a Pairwise Algorithm for Computing Sample
+ * Variances, T. F. Chan, G. H. Golub & R. J. LeVeque, COMPSTAT 1982.
+ *
+ * [3] Numerically Stable Parallel Computation of (Co-)Variance, Erich
+ * Schubert and Michael Gertz, Proceedings of the 30th International
+ * Conference on Scientific and Statistical Database Management, 2018.
  */
 
 static float8 *
@@ -2441,18 +2467,90 @@ float8_combine(PG_FUNCTION_ARGS)
 	ArrayType  *transarray2 = PG_GETARG_ARRAYTYPE_P(1);
 	float8	   *transvalues1;
 	float8	   *transvalues2;
-
-	if (!AggCheckCallContext(fcinfo, NULL))
-		elog(ERROR, "aggregate function called in non-aggregate context");
+	float8		N1,
+Sx1,
+Sxx1,
+N2,
+Sx2,
+Sxx2,
+tmp,
+N,
+Sx,
+Sxx;
 
 	transvalues1 = check_float8_array(transarray1, "float8_combine", 3);
 	transvalues2 = check_float8_array(transarray2, "float8_combine", 3);
 
-	transvalues1[0] = transvalues1[0] + transvalues2[0];
-	transvalues1[1] = float8_pl(transvalues1[1], transvalues2[1]);
-	transvalues1[2] = float8_pl(transvalues1[2], transvalues2[2]);
+	N1 = transvalues1[0];
+	Sx1 = transvalues1[1];
+	Sxx1 = transvalues1[2];
 
-	PG_RETURN_ARRAYTYPE_P(transarray1);
+	N2 = transvalues2[0];
+	Sx2 = transvalues2[1];
+	Sxx2 = transvalues2[2];
+
+	/*
+	 * The transition values combine using a generalization of the
+	 * Youngs-Cramer algorithm as follows:
+	 *
+	 *	N = N1 + N2
+	 *	Sx = Sx1 + Sx2
+	 *	Sxx = Sxx1 + Sxx2 + N1 * N2 * (Sx1/N1 - Sx2/N2)^2 / N;
+	 *
+	 * It's worth handling the special cases N1 = 0 and N2 = 0 separately
+	 * since those cases are trivial, and we then don't need to worry about
+	 * division-by-zero errors in the general case.
+	 *
+	 */
+	if (N1 == 0.0)
+	{
+		N = N2;
+		Sx = Sx2;
+		Sxx = Sxx2;
+	}
+	else if (N2 == 0.0)
+	{
+		N = N1;
+		Sx = Sx1;
+		Sxx = Sxx1;
+	}
+	else
+	{
+		N = N1 + N2;
+		Sx = float8_pl(Sx1, Sx2);
+		tmp = Sx1 / N1 - Sx2 / N2;
+		Sxx = Sxx1 + Sxx2 + N1 * N2 * tmp * tmp / N;
+		check_float8_val(Sxx, isinf(Sxx1) || isinf(Sxx2), true);
+	}
+
+	/*
+	 * If we're invoked as an aggregate, we can cheat and modify our first
+	 * parameter in-place to reduce palloc overhead. Otherwise we construct a
+	 * new array with the updated transition data and return it.
+	 */
+	if (AggCheckCallContext(fcinfo, NULL))
+	{
+		transvalues1[0] = N;
+		transvalues1[1] = Sx;
+		transvalues1[2] = Sxx;
+
+		PG_RETURN_ARRAYTYPE_P(transarray1);
+	}
+	else
+

Re: BUG #15307: Low numerical precision of (Co-) Variance

2018-10-06 Thread Dean Rasheed

On Wed, 3 Oct 2018 at 15:58, Madeleine Thompson  wrote:
> This diff looks good to me. Also, it applies cleanly against
> abd9ca377d669a6e0560e854d7e987438d0e612e and passes `make
> check-world`.
>
> I agree that this is not suitable for a patch release.
>

Pushed to master. Thanks for the review.

Regards,
Dean

Re: [HACKERS] proposal: schema variables

2018-09-04 Thread Dean Rasheed

AFAICS this patch does nothing to consider parallel safety -- that is,
as things stand, a variable is allowed in a query that may be
parallelised, but its value is not copied to workers, leading to
incorrect results. For example:

create table foo(a int);
insert into foo select * from generate_series(1,100);
create variable zero int;
let zero = 0;

explain (costs off) select count(*) from foo where a%10 = zero;

  QUERY PLAN
---
 Finalize Aggregate
   ->  Gather
 Workers Planned: 2
 ->  Partial Aggregate
   ->  Parallel Seq Scan on foo
 Filter: ((a % 10) = zero)
(6 rows)

select count(*) from foo where a%10 = zero;

 count
---
 38037-- Different random result each time, should be 100,000
(1 row)

Thoughts?

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-17 Thread Dean Rasheed

On Thu, 17 Jan 2019 at 03:42, David Rowley  wrote:
> 39. I don't see analyze_mcv_list() being used anywhere around this comment:
>
> * If we can fit all the items onto the MCV list, do that. Otherwise use
> * analyze_mcv_list to decide how many items to keep in the MCV list, just
> * like for the single-dimensional MCV list.
>

Right. Also, analyze_mcv_list() is no longer being used anywhere
outside of analyze.c, so it can go back to being static.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-16 Thread Dean Rasheed

On Thu, 17 Jan 2019 at 03:42, David Rowley  wrote:
> 35. The evaluation order of this macro is wrong.
>
> #define ITEM_SIZE(ndims) \
> (ndims * (sizeof(uint16) + sizeof(bool)) + 2 * sizeof(double))
>
> You'd probably want ITEM_SIZE(10) to return 170, but:
>
> select (10 * (2 + 1) + 2 * 8);
>  ?column?
> --
>46
>
> Unsure why this does not cause a crash.
>

No, the code is actually correct, as explained in the comment above
it. Each item contains (ndims) copies of the uint16 index and the
boolean, but it always contains exactly 2 doubles, independent of
ndims.

> ndims should also have parenthesis around it in case someone does
> ITEM_SIZE(x + y), likewise for the other ITEM_* macros.
>

+1 on that point.

Regards,
Dean

Re: Statement-level Triggers For Uniqueness Checks

2018-12-25 Thread Dean Rasheed

On Mon, 24 Dec 2018 at 23:57, Corey Huinker  wrote:
>
> So I took a first pass at this, and I got stuck.
>
> [snip]
>
> Any idea where I went wrong?

Take a look at this code in AfterTriggerSaveEvent():

/*
 * If the trigger is a deferred unique constraint check trigger, only
 * queue it if the unique constraint was potentially violated, which
 * we know from index insertion time.
 */
if (trigger->tgfoid == F_UNIQUE_KEY_RECHECK)
{
if (!list_member_oid(recheckIndexes, trigger->tgconstrindid))
continue;   /* Uniqueness definitely not violated */
}

If you trace it back, you'll see that for a statement-level trigger,
recheckIndexes will always be empty.

Regards,
Dean

Re: Statement-level Triggers For Uniqueness Checks

2018-12-25 Thread Dean Rasheed

On Tue, 25 Dec 2018 at 08:04, Dean Rasheed  wrote:
> Take a look at this code in AfterTriggerSaveEvent():
>

Note that the intention behind that code is that in the (fairly
common) case where an insert or update operation is known to not lead
to any potential PK/UNIQUE index violations, the overhead for the
deferred recheck is minimal. For example, consider a bulk insert where
IDs come from a sequence known to not overlap with existing IDs. In
that case, each new ID is determined at index insertion time to not
possibly conflict with any existing ID, recheckIndexes remains empty
for each row, and no recheck triggers are ever queued.

One of the tricky things about replacing the current rechecks with
statement level triggers will be working out how to deal with such
cases (or cases with just a tiny fraction of keys to be rechecked)
without introducing a large overhead.

Regards,
Dean

Re: COPY FROM WHEN condition

2018-11-24 Thread Dean Rasheed

On Sat, 24 Nov 2018 at 02:09, Tomas Vondra  wrote:
> On 11/23/18 12:14 PM, Surafel Temesgen wrote:
> > On Sun, Nov 11, 2018 at 11:59 PM Tomas Vondra
> > mailto:tomas.von...@2ndquadrant.com>> wrote:
> > So, what about using FILTER here? We already use it for aggregates when
> > filtering rows to process.
> >
> > i think its good idea and describe its purpose more. Attache is a
> > patch that use FILTER instead
>

FWIW, I vote for just using WHERE here.

Right now we have 2 syntaxes for filtering rows in queries, both of
which use WHERE immediately before the condition:

1). SELECT ... FROM ... WHERE condition

2). SELECT agg_fn FILTER (WHERE condition) FROM ...

I'm not a huge fan of (2), but that's the SQL standard, so we're stuck
with it. There's a certain consistency in it's use of WHERE to
introduce the condition, and preceding that with FILTER helps to
distinguish it from any later WHERE clause. But what you'd be adding
here would be a 3rd syntax

3). COPY ... FROM ... FILTER condition

which IMO will just lead to confusion.

Regards,
Dean

Re: BUG #15446: Crash on ALTER TABLE

2019-01-10 Thread Dean Rasheed

On Wed, 9 Jan 2019 at 23:24, Andrew Dunstan
 wrote:
> Here's another attempt. For the rewrite case it kept the logic of the
> previous patch to clear all the missing attributes, but if we're not
> rewriting we reconstruct the missing value according to the new type
> settings.
>

Looks good to me.

Regards,
Dean

Policy on cross-posting to multiple lists

2019-01-10 Thread Dean Rasheed

Has the policy on cross-posting to multiple lists been hardened recently?

The "Crash on ALTER TABLE" thread [1] started on -bugs, but Andrew's
message on 8 Jan with an initial proposed patch and my response later
that day both CC'ed -hackers and seem to have been rejected, and so
are missing from the archives.

In that case, it's not a big deal because subsequent replies included
the text from the missing messages, so it's still possible to follow
the discussion, but I wanted to check whether this was an intentional
change of policy. If so, it seems a bit harsh to flat-out reject these
messages. My prior understanding was that cross-posting, while
generally discouraged, does still sometimes have value.

[1] 
https://www.postgresql.org/message-id/flat/CAEZATCVqksrnXybSaogWOzmVjE3O-NqMSpoHDuDw9_7mhNpeLQ%40mail.gmail.com#2c25e9a783d4685912dcef8b3f3edd63

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-10 Thread Dean Rasheed

On Wed, 9 Jan 2019 at 15:40, Tomas Vondra  wrote:
> On 1/8/19 3:18 PM, Dean Rasheed wrote:
> > So actually, the estimate for a group of values will be either the MCV
> > item's frequency (if the MCV item is kept), or (roughly) the MCV
> > item's base_frequency (if the MCV item is not kept). That suggests
> > that we should simply keep items that are significantly more or less
> > common than the item's base frequency -- i.e., keep rule (b) and ditch
> > rule (a).
> >
>
> Hmmm, but won't that interfere with how we with how we extrapolate the
> MCV estimate to the non-MCV part? Currently the patch does what you
> proposed, i.e.
>
> other_sel = simple_sel - mcv_basesel;
>
> I'm worried that if we only include the items that are significantly
> more or less common than the base frequency, it may skew the other_sel
> estimate.
>

I don't see how that would skew other_sel. Items close to the base
frequency would also tend to be close to simple_sel, making other_sel
approximately zero, so excluding them should have little effect.
However...

Re-reading the thread where we enhanced the per-column MCV stats last
year [1], it was actually the case that an algorithm based on just
looking at the relative standard error worked pretty well for a very
wide range of data distributions.

The final algorithm chosen in analyze_mcv_list() was only a marginal
improvement on that, and was directly based upon the fact that, in the
univariate statistics case, all the values not included in the MCV
list are assigned the same selectivity. However, that's not the case
for multivariate stats, because each group not included in the
multivariate MCV list gets assigned a different selectivity based on
its per-column stats.

So perhaps what we should do for multivariate stats is simply use the
relative standard error approach (i.e., reuse the patch in [2] with a
20% RSE cutoff). That had a lot of testing at the time, against a wide
range of data distributions, and proved to be very good, not to
mention being very simple.

That approach would encompass both groups more and less common than
the base frequency, because it relies entirely on the group appearing
enough times in the sample to infer that any errors on the resulting
estimates will be reasonably well controlled. It wouldn't actually
look at the base frequency at all in deciding which items to keep.

Moreover, if the group appears sufficiently often in the sample to
justify being kept, each of the individual column values must also
appear at least that often as well, which means that the errors on the
base frequency estimate are also well controlled. That was one of my
concerns about other algorithms such as "keep items significantly more
or less common than the base frequency" -- in the less common case,
there's no lower bound on the number of occurrences seen, and so no
guarantee that the errors are kept under control.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/CAMkU%3D1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO%2Ba%3D4DVkzpuQ2tw%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/CAEZATCUEmHCZeOHJN8JO5O9LK_VuFeCecy_AxTk7S_2SmLXeyw%40mail.gmail.com

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-10 Thread Dean Rasheed

On Wed, 26 Dec 2018 at 22:09, Tomas Vondra  wrote:
>
> Attached is an updated version of the patch - rebased and fixing the
> warnings reported by Thomas Munro.
>

Here are a few random review comments based on what I've read so far:

On the CREATE STATISTICS doc page, the syntax in the new examples
added to the bottom of the page is incorrect. E.g., instead of

CREATE STATISTICS s2 WITH (mcv) ON (a, b) FROM t2;

it should read

CREATE STATISTICS s2 (mcv) ON a, b FROM t2;

I think perhaps there should be also be a short explanatory sentence
after each example (as in the previous one) just to explain what the
example is intended to demonstrate. E.g., for the new MCV example,
perhaps say

   These statistics give the planner more detailed information about the
   specific values that commonly appear in the table, as well as an upper
   bound on the selectivities of combinations of values that do not appear in
   the table, allowing it to generate better estimates in both cases.

I don't think there's a need for too much detail there, since it's
explained more fully elsewhere, but it feels like it needs a little
more just to explain the purpose of the example.

There is additional documentation in perform.sgml that needs updating
-- about what kinds of stats the planner keeps. Those docs are
actually quite similar to the ones on planstats.sgml. It seems the
former focus more one what stats the planner stores, while the latter
focus on how the planner uses those stats.

In func.sgml, the docs for pg_mcv_list_items need extending to include
the base frequency column. Similarly for the example query in
planstats.sgml.

Tab-completion for the CREATE STATISTICS statement should be extended
for the new kinds.

Looking at mcv_update_match_bitmap(), it's called 3 times (twice
recursively from within itself), and I think the pattern for calling
it is a bit messy. E.g.,

/* by default none of the MCV items matches the clauses */
bool_matches = palloc0(sizeof(char) * mcvlist->nitems);

if (or_clause(clause))
{
/* OR clauses assume nothing matches, initially */
memset(bool_matches, STATS_MATCH_NONE, sizeof(char) *
mcvlist->nitems);
}
else
{
/* AND clauses assume everything matches, initially */
memset(bool_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);
}

/* build the match bitmap for the OR-clauses */
mcv_update_match_bitmap(root, bool_clauses, keys,
mcvlist, bool_matches,
or_clause(clause));

the comment for the AND case directly contradicts the initial comment,
and the final comment is wrong because it could be and AND clause. For
a NOT clause it does:

/* by default none of the MCV items matches the clauses */
not_matches = palloc0(sizeof(char) * mcvlist->nitems);

/* NOT clauses assume nothing matches, initially */
memset(not_matches, STATS_MATCH_FULL, sizeof(char) *
mcvlist->nitems);

/* build the match bitmap for the NOT-clause */
mcv_update_match_bitmap(root, not_args, keys,
mcvlist, not_matches, false);

so the second comment is wrong. I understand the evolution that lead
to this function existing in this form, but I think that it can now be
refactored into a "getter" function rather than an "update" function.
I.e., something like mcv_get_match_bitmap() which first allocates the
array to be returned and initialises it based on the passed-in value
of is_or. That way, all the calling sites can be simplified to
one-liners like

/* get the match bitmap for the AND/OR clause */
bool_matches = mcv_get_match_bitmap(root, bool_clauses, keys,
mcvlist, or_clause(clause));

In the previous discussion around UpdateStatisticsForTypeChange(), the
consensus appeared to be that we should just unconditionally drop all
extended statistics when ALTER TABLE changes the type of an included
column (just as we do for per-column stats), since such a type change
can rewrite the data in arbitrary ways, so there's no reason to assume
that the old stats are still valid. I think it makes sense to extract
that as a separate patch to be committed ahead of these ones, and I'd
also argue for back-patching it.

That's it for now. I'll try to keep reviewing if time permits.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-08 Thread Dean Rasheed

On Mon, 7 Jan 2019 at 00:45, Tomas Vondra  wrote:
>
> FWIW the main unsolved issue (at least on the MCV part) is how it
> decides which items to keep in the list.
>
> As explained in [1], in the multivariate case we can't simply look at
> the group frequency and compare it to the average frequency (of the
> non-MCV items), which is what analyze_mcv_list() does in the
> single-column case. In the multivariate case we also case about observed
> vs. base frequency, i.e. we want the MCV list to include groups that are
> present singificantly more/less than product of per-column stats.
>
> I've repeatedly tried to come up with a criteria that would address
> that, but it seems rather difficult because we can't abandon the other
> criteria either. So the MCV list should include groups that match both
>
> (a) items that are statistically more common than the non-MCV part (i.e.
> the rule from per-column analyze_mcv_list)
>
> (b) items that are statistically more/less common than estimated from
> per-column stats (i.e. the new rule)

Thinking about this some more, I think that it probably isn't
appropriate to use analyze_mcv_list() directly because that's making
specific assumptions about how items not in the list will be estimated
that aren't actually true for groups of values in multivariate stats.
If a group of values isn't in the MCV list, it gets estimated based on
the product of the selectivities from the per-column stats (modulo the
additional logic preventing the selectivity not exceeding the total
non-MCV selectivity).

So actually, the estimate for a group of values will be either the MCV
item's frequency (if the MCV item is kept), or (roughly) the MCV
item's base_frequency (if the MCV item is not kept). That suggests
that we should simply keep items that are significantly more or less
common than the item's base frequency -- i.e., keep rule (b) and ditch
rule (a).

> Enforcing rule (a) seems reasonable because it ensures the MCV list
> includes all items more frequent than the last one. Without it, it's
> difficult to decide know whether the absent item is very common (but
> close to base frequency) or very uncommon (so less frequent than the
> last MCV item).

I'm not sure there's much we can do about that. Keeping the item will
result in keeping a frequency that we know is close to the base
frequency, and not keeping the item will result in per-column stats
being used that we expect to also give an estimate close to the base
frequency. So either way, the result is about the same, and it's
probably better to discard it, leaving more room for other items about
which we may have more information.

That said, there is a separate benefit to keeping items in the list
even if their frequency is close to the base frequency -- the more
items kept, the larger their total selectivity will be, giving a
better cap on the non-MCV selectivities. So if, after keeping all
items satisfying rule (b), there are free slots available, perhaps
they should be used for the most common remaining values satisfying
rule (a).

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-11 Thread Dean Rasheed

On Fri, 11 Jan 2019, 21:18 Tomas Vondra 
> On 1/10/19 4:20 PM, Dean Rasheed wrote:
> > ...
> >
> > So perhaps what we should do for multivariate stats is simply use the
> > relative standard error approach (i.e., reuse the patch in [2] with a
> > 20% RSE cutoff). That had a lot of testing at the time, against a wide
> > range of data distributions, and proved to be very good, not to
> > mention being very simple.
> >
> > That approach would encompass both groups more and less common than
> > the base frequency, because it relies entirely on the group appearing
> > enough times in the sample to infer that any errors on the resulting
> > estimates will be reasonably well controlled. It wouldn't actually
> > look at the base frequency at all in deciding which items to keep.
> >
>
> I've been looking at this approach today, and I'm a bit puzzled. That
> patch essentially uses SRE to compute mincount like this:
>
> mincount = n*(N-n) / (N-n+0.04*n*(N-1))
>
> and then includes all items more common than this threshold.

Right.

How could
> that handle items significantly less common than the base frequency?
>

Well what I meant was that it will *allow* items significantly less common
than the base frequency, because it's not even looking at the base
frequency. For example, if the table size were N=100,000 and we sampled
n=10,000 rows from that, mincount would work out as 22. So it's easy to
construct allowed items more common than that and still significantly less
common than their base frequency.

A possible refinement would be to say that if there are more than
stats_target items more common than this mincount threshold, rather than
excluding the least common ones to get the target number of items, exclude
the ones closest to their base frequencies, on the grounds that those are
the ones for which the MCV stats will make the least difference. That might
complicate the code somewhat though -- I don't have it in front of me, so I
can't remember if it even tracks more than stats_target items.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-14 Thread Dean Rasheed

On Sun, 13 Jan 2019 at 00:04, Tomas Vondra  wrote:
> On 1/12/19 8:49 AM, Dean Rasheed wrote:
> > A possible refinement would be to say that if there are more than
> > stats_target items more common than this mincount threshold, rather than
> > excluding the least common ones to get the target number of items,
> > exclude the ones closest to their base frequencies, on the grounds that
> > those are the ones for which the MCV stats will make the least
> > difference. That might complicate the code somewhat though -- I don't
> > have it in front of me, so I can't remember if it even tracks more than
> > stats_target items.
>
> Yes, the patch does limit the number of items to stats_target (a maximum
> of per-attribute stattarget values, to be precise). IIRC that's a piece
> you've added sometime last year ;-)
>
> I've been experimenting with removing items closest to base frequencies
> today, and I came to the conclusion that it's rather tricky for a couple
> of reasons.
>
> 1) How exactly do you measure "closeness" to base frequency? I've tried
> computing the error in different ways, including:
>
>   * Max(freq/base, base/freq)
>   * abs(freq - base)
>
> but this does not seem to affect the behavior very much, TBH.
>
> 2) This necessarily reduces mcv_totalsel, i.e. it increases the part not
> covered by MCV. And estimates on this part are rather crude.
>
> 3) It does nothing for "impossible" items, i.e. combinations that do not
> exist at all. Clearly, those won't be part of the sample, and so can't
> be included in the MCV no matter which error definition we pick. And for
> very rare combinations it might lead to sudden changes, depending on
> whether the group gets sampled or not.
>
> So IMHO it's better to stick to the simple SRE approach for now.
>

OK, that makes sense.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-14 Thread Dean Rasheed

(Removing Adrien from the CC list, because messages to that address
keep bouncing)

On Sun, 13 Jan 2019 at 00:31, Tomas Vondra  wrote:
>
> On 1/10/19 6:09 PM, Dean Rasheed wrote:
> >
> > In the previous discussion around UpdateStatisticsForTypeChange(), the
> > consensus appeared to be that we should just unconditionally drop all
> > extended statistics when ALTER TABLE changes the type of an included
> > column (just as we do for per-column stats), since such a type change
> > can rewrite the data in arbitrary ways, so there's no reason to assume
> > that the old stats are still valid. I think it makes sense to extract
> > that as a separate patch to be committed ahead of these ones, and I'd
> > also argue for back-patching it.
>
> Wasn't the agreement to keep stats that don't include column values
> (functional dependencies and ndistinct coefficients), and reset only
> more complex stats? That's what happens in master and how it's extended
> by the patch for MCV lists and histograms.
>

Ah OK, I misremembered the exact conclusion reached last time. In that
case the logic in UpdateStatisticsForTypeChange() looks wrong:

/*
 * If we can leave the statistics as it is, just do minimal cleanup
 * and we're done.
 */
if (!attribute_referenced && reset_stats)
{
ReleaseSysCache(oldtup);
return;
}

That should be "|| !reset_stats", or have more parentheses. In fact, I
think that computing attribute_referenced is unnecessary because the
dependency information includes the columns that the stats are for and
ATExecAlterColumnType() uses that, so attribute_referenced will always
be true.

Regards,
Dean

Re: BUG #15708: RLS 'using' running as wrong user when called from a view

2019-03-27 Thread Dean Rasheed

On Mon, 25 Mar 2019 at 20:27, Stephen Frost  wrote:
>
> * Dean Rasheed (dean.a.rash...@gmail.com) wrote:
>
> > It looks like the best place to fix it is in
> > get_policies_for_relation(), since that's where all the policies to be
> > applied for a given RTE are pulled together. Patch attached.
>
> Yes, on a quick review, that looks like a good solution to me as well.
>

On second thoughts, it actually needs to be in
get_row_security_policies(), after making copies of the quals from the
policies, otherwise it would be scribbling on the copies from the
relcache. Actually that makes the code change a bit simpler too.

Regards,
Dean
diff --git a/src/backend/rewrite/rowsecurity.c b/src/backend/rewrite/rowsecurity.c
new file mode 100644
index 835540c..e9f532c
--- a/src/backend/rewrite/rowsecurity.c
+++ b/src/backend/rewrite/rowsecurity.c
@@ -47,6 +47,7 @@
 #include "nodes/pg_list.h"
 #include "nodes/plannodes.h"
 #include "parser/parsetree.h"
+#include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteHandler.h"
 #include "rewrite/rewriteManip.h"
 #include "rewrite/rowsecurity.h"
@@ -382,6 +383,13 @@ get_row_security_policies(Query *root, R
 	table_close(rel, NoLock);
 
 	/*
+	 * Copy checkAsUser to the row security quals and WithCheckOption checks,
+	 * in case they contain any subqueries referring to other relations.
+	 */
+	setRuleCheckAsUser((Node *) *securityQuals, rte->checkAsUser);
+	setRuleCheckAsUser((Node *) *withCheckOptions, rte->checkAsUser);
+
+	/*
 	 * Mark this query as having row security, so plancache can invalidate it
 	 * when necessary (eg: role changes)
 	 */
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
new file mode 100644
index 4b53401..d764991
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3910,6 +3910,33 @@ DROP OWNED BY regress_rls_dob_role1;
 DROP POLICY p1 ON dob_t2; -- should succeed
 DROP USER regress_rls_dob_role1;
 DROP USER regress_rls_dob_role2;
+-- Bug #15708: view + table with RLS should check policies as view owner
+CREATE TABLE ref_tbl (a int);
+INSERT INTO ref_tbl VALUES (1);
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl VALUES (10);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (EXISTS (SELECT 1 FROM ref_tbl));
+GRANT SELECT ON ref_tbl TO regress_rls_bob;
+GRANT SELECT ON rls_tbl TO regress_rls_bob;
+CREATE VIEW rls_view AS SELECT * FROM rls_tbl;
+ALTER VIEW rls_view OWNER TO regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_alice;
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM ref_tbl; -- Permission denied
+ERROR:  permission denied for table ref_tbl
+SELECT * FROM rls_tbl; -- Permission denied
+ERROR:  permission denied for table rls_tbl
+SELECT * FROM rls_view; -- OK
+ a  
+
+ 10
+(1 row)
+
+RESET SESSION AUTHORIZATION;
+DROP VIEW rls_view;
+DROP TABLE rls_tbl;
+DROP TABLE ref_tbl;
 --
 -- Clean up objects
 --
diff --git a/src/test/regress/sql/rowsecurity.sql b/src/test/regress/sql/rowsecurity.sql
new file mode 100644
index ea83153..df8fe11
--- a/src/test/regress/sql/rowsecurity.sql
+++ b/src/test/regress/sql/rowsecurity.sql
@@ -1764,6 +1764,32 @@ DROP POLICY p1 ON dob_t2; -- should succ
 DROP USER regress_rls_dob_role1;
 DROP USER regress_rls_dob_role2;
 
+-- Bug #15708: view + table with RLS should check policies as view owner
+CREATE TABLE ref_tbl (a int);
+INSERT INTO ref_tbl VALUES (1);
+
+CREATE TABLE rls_tbl (a int);
+INSERT INTO rls_tbl VALUES (10);
+ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
+CREATE POLICY p1 ON rls_tbl USING (EXISTS (SELECT 1 FROM ref_tbl));
+
+GRANT SELECT ON ref_tbl TO regress_rls_bob;
+GRANT SELECT ON rls_tbl TO regress_rls_bob;
+
+CREATE VIEW rls_view AS SELECT * FROM rls_tbl;
+ALTER VIEW rls_view OWNER TO regress_rls_bob;
+GRANT SELECT ON rls_view TO regress_rls_alice;
+
+SET SESSION AUTHORIZATION regress_rls_alice;
+SELECT * FROM ref_tbl; -- Permission denied
+SELECT * FROM rls_tbl; -- Permission denied
+SELECT * FROM rls_view; -- OK
+RESET SESSION AUTHORIZATION;
+
+DROP VIEW rls_view;
+DROP TABLE rls_tbl;
+DROP TABLE ref_tbl;
+
 --
 -- Clean up objects
 --

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-26 Thread Dean Rasheed

On Tue, 26 Mar 2019 at 11:59, Dean Rasheed  wrote:
>
> On Mon, 25 Mar 2019 at 23:36, Tomas Vondra  
> wrote:
> >
> > Attached is an updated patch...
>
> I just looked through the latest set of changes and I have a couple of
> additional review comments:
>

I just spotted another issue while reading the code:

It's possible to build an MCV list with more than
STATS_MCVLIST_MAX_ITEMS = 8192 items, which then causes an error when
the code tries to read it back in:

create temp table foo(a int, b int);
insert into foo select x,x from generate_series(1,1) g(x);
insert into foo select x,x from generate_series(1,1) g(x);
alter table foo alter column a set statistics 1;
alter table foo alter column b set statistics 1;
create statistics s (mcv) on a,b from foo;
analyse foo;
select * from foo where a=1 and b=1;

ERROR:  invalid length (1) item array in MCVList

So this needs to be checked when building the MCV list.

In fact, the stats targets for table columns can be as large as 1
(a hard-coded constant in tablecmds.c, which is pretty ugly, but
that's a different matter), so I think STATS_MCVLIST_MAX_ITEMS
probably ought to match that.

There are also a couple of comments that refer to the 8k limit, which
would need updating, if you change it.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-26 Thread Dean Rasheed

On Mon, 25 Mar 2019 at 23:36, Tomas Vondra  wrote:
>
> Attached is an updated patch, fixing all the issues pointed out so far.
> Unless there are some objections, I plan to commit the 0001 part by the
> end of this CF. Part 0002 is a matter for PG13, as previously agreed.
>

Yes, I think that's reasonable. It looks to be in pretty good shape. I
have reviewed most of the actual code, but note that I haven't
reviewed the docs changes and I didn't spend much time reading code
comments. It might benefit from a quick once-over comment tidy up.

I just looked through the latest set of changes and I have a couple of
additional review comments:

In the comment about WIDTH_THRESHOLD, s/pg_statistic/pg_statistic_ext/.

In statext_mcv_build(), I'm not convinced by the logic around keeping
the whole MCV list if it fits. Suppose there were a small number of
very common values, and then a bunch of uniformly distributed less
common values. The sample might consist of all the common values, plus
one or two instances of some of the uncommon ones, leading to a list
that would fit, but it would not be appropriate to keep the uncommon
values on the basis of having seen them only one or two times. The
fact that the list of items seen fits doesn't by itself mean that
they're all common enough to justify being kept. In the per-column
stats case, there are a bunch of other checks that have to pass, which
are intended to test not just that the list fits, but that it believes
that those are all the items in the table. For MV stats, you don't
have that, and so I think it would be best to just remove that test
(the "if (ngroups > nitems)" test) and *always* call
get_mincount_for_mcv_list() to determine how many MCV items to keep.
Otherwise there is a risk of keeping too many MCV items, with the ones
at the tail end of the list producing poor estimates.

Regards,
Dean

Re: BUG #15708: RLS 'using' running as wrong user when called from a view

2019-03-24 Thread Dean Rasheed

On Thu, 21 Mar 2019 at 00:39, PG Bug reporting form
 wrote:
>
> This fails, seemingly because the RLS on 'bar' is being checked by alice,
> instead of the view owner bob:
>

Yes I agree, that appears to be a bug. The subquery in the RLS policy
should be checked as the view owner -- i.e., we need to propagate the
checkAsUser for the RTE with RLS to any subqueries in its RLS
policies.

It looks like the best place to fix it is in
get_policies_for_relation(), since that's where all the policies to be
applied for a given RTE are pulled together. Patch attached.

Regards,
Dean

rls-perm-check-fix.patch
Description: Binary data

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-24 Thread Dean Rasheed

On Sun, 24 Mar 2019 at 00:17, David Rowley  wrote:
>
> On Sun, 24 Mar 2019 at 12:41, Tomas Vondra  
> wrote:
> >
> > On 3/21/19 4:05 PM, David Rowley wrote:
>
> > > 29. Looking at the tests I see you're testing that you get bad
> > > estimates without extended stats.  That does not really seem like
> > > something that should be done in tests that are meant for extended
> > > statistics.
> > >
> >
> > True, it might be a bit unnecessary. Initially the tests were meant to
> > show old/new estimates for development purposes, but it might not be
> > appropriate for regression tests. I don't think it's a big issue, it's
> > not like it'd slow down the tests significantly. Opinions?
>
> My thoughts were that if someone did something to improve non-MV
> stats, then is it right for these tests to fail? What should the
> developer do in the case? update the expected result? remove the test?
> It's not so clear.
>

I think the tests are fine as they are. Don't think of these as "good"
and "bad" estimates. They should both be "good" estimates, but under
different assumptions -- one assuming no correlation between columns,
and one taking into account the relationship between the columns. If
someone does do something to "improve" the non-MV stats, then the
former tests ought to tell us whether it really was an improvement. If
so, then the test result can be updated and perhaps whatever was done
ought to be factored into the MV-stats' calculation of base
frequencies. If not, the test is providing valuable feedback that
perhaps it wasn't such a good improvement after all.

Regards,
Dean

Re: BUG #15623: Inconsistent use of default for updatable view

2019-02-28 Thread Dean Rasheed

On Thu, 28 Feb 2019 at 07:47, Amit Langote
 wrote:
>
> +if (attrno == 0)
> +elog(ERROR, "Cannot set value in column %d to
> DEFAULT", i);
>
> Maybe: s/Cannot/cannot/g
>

Ah yes, you're right. That is the convention.

> +Assert(list_length(sublist) == numattrs);
>
> Isn't this Assert useless, because we're setting numattrs to
> list_length() and transformInsertStmt ensures that all
> sublists are same length?
>

Well possibly I'm being over-paranoid, but given that it may have come
via a previous invocation of rewriteValuesRTE() that may have
completely rebuilt the lists, it seemed best to be sure that it hadn't
done something unexpected. It's about to use that to read from the
attrnos array, so it might read beyond the array bounds if any of the
prior logic was faulty.

Thanks for looking.

Regards,
Dean

Re: BUG #15623: Inconsistent use of default for updatable view

2019-02-27 Thread Dean Rasheed

On Tue, 12 Feb 2019 at 10:33, Dean Rasheed  wrote:
> Here's an updated patch ...

So I pushed that. However, ...

Playing around with it some more, I realised that whilst this appeared
to fix the reported problem, it exposes another issue which is down to
the interaction between rewriteTargetListIU() and rewriteValuesRTE()
--- for an INSERT with a VALUES RTE, rewriteTargetListIU() computes a
list of target attribute numbers (attrno_list) from the targetlist in
its original order, which rewriteValuesRTE() then relies on being the
same length (and in the same order) as each of the lists in the VALUES
RTE. That's OK for the initial invocation of those functions, but it
breaks down when they're recursively invoked for auto-updatable views.

For example, the following test-case, based on a slight variation of
the new regression tests:

create table foo (a int default 1, b int);
create view foo_v as select * from foo;
alter view foo_v alter column b set default 2;
insert into foo_v values (default), (default);

triggers the following Assert in rewriteValuesRTE():

/* Check list lengths (we can assume all the VALUES sublists are alike) */
Assert(list_length(attrnos) == list_length(linitial(rte->values_lists)));

What's happening is that the initial targetlist, which contains just 1
entry for the column a, gains another entry to set the default for
column b from the view. Then, when it recurses into
rewriteTargetListIU()/rewriteValuesRTE() for the base relation, the
size of the targetlist (now 2) no longer matches the sizes of the
VALUES RTE lists (1).

I think that that failed Assert was unreachable prior to 41531e42d3,
because the old version of rewriteValuesRTE() would always replace all
unresolved DEFAULT items with NULLs, so when it recursed into
rewriteValuesRTE() for the base relation, it would always bail out
early because there would be no DEFAULT items left, and so it would
fail to notice the list size mismatch.

My first thought was that this could be fixed by having
rewriteTargetListIU() compute attrno_list using only those targetlist
entries that refer to the VALUES RTE, and thus omit any new entries
added due to view defaults. That doesn't work though, because that's
not the only way that a list size mismatch can be triggered --- it's
also possible that rewriteTargetListIU() will merge together
targetlist entries, for example if they're array element assignments
referring to the same column, in which case the rewritten targetlist
could be shorter than the VALUES RTE lists, and attempting to compute
attrno_list from it correctly would be trickier.

So actually, I think the right way to fix this is to give up trying to
compute attrno_list in rewriteTargetListIU(), and have
rewriteValuesRTE() work out the attribute assignments itself for each
column of the VALUES RTE by examining the rewritten targetlist. That
looks to be quite straightforward, and results in a cleaner separation
of logic between the 2 functions, per the attached patch.

I think that rewriteValuesRTE() should only ever see DEFAULT items for
columns that are simple assignments to columns of the target relation,
so it only needs to work out the target attribute numbers for TLEs
that contain simple Vars referring to the VALUES RTE. Any DEFAULT seen
in a column not matching that would be an error, but actually I think
such a thing ought to be a "can't happen" error because of the prior
checks during parse analysis, so I've used elog() to report if this
does happen.

Incidentally, it looks like the part of rewriteValuesRTE()'s comment
about subscripted and field assignment has been wrong since
a3c7a993d5, so I've attempted to clarify that. That will need to look
different pre-9.6, I think.

Thoughts?

Regards,
Dean
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
new file mode 100644
index 7eb41ff..73fac75
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -67,14 +67,13 @@ static List *rewriteTargetListIU(List *t
 	CmdType commandType,
 	OverridingKind override,
 	Relation target_relation,
-	int result_rti,
-	List **attrno_list);
+	int result_rti);
 static TargetEntry *process_matched_tle(TargetEntry *src_tle,
 	TargetEntry *prior_tle,
 	const char *attrName);
 static Node *get_assignment_input(Node *node);
-static bool rewriteValuesRTE(Query *parsetree, RangeTblEntry *rte,
- Relation target_relation, List *attrnos, bool force_nulls);
+static bool rewriteValuesRTE(Query *parsetree, RangeTblEntry *rte, int rti,
+ Relation target_relation, bool force_nulls);
 static void markQueryForLocking(Query *qry, Node *jtnode,
 	LockClauseStrength strength, LockWaitPolicy waitPolicy,
 	bool pushedDown);
@@ -705,11 +704,6 @@ adjustJoinTreeList(Query *parsetree, boo
  * is not needed for rewriting, but will be needed by the planner, and we
  * can do it essentially for free while handling

Re: Row Level Security − leakproof-ness and performance implications

2019-02-28 Thread Dean Rasheed

On Thu, 28 Feb 2019 at 14:13, Robert Haas  wrote:
> A wild idea might be to let
> proleakproof take on three values: yes, no, and maybe.  When 'maybe'
> functions are involved, we tell them whether or not the current query
> involves any security barriers, and if so they self-censor.
>

Does self-censoring mean that they might still throw an error for some
inputs, but that error won't reveal any information about the input
values? That's not entirely consistent with my understanding of the
definition of leakproof, but maybe there are situations where that
amount of information leakage would be OK. So maybe we could have
"strictly leakproof" functions that never throw errors and "weakly
leakproof" functions (needs a better name) that can throw errors, as
long as those errors don't include data values. Then we could allow
strict and weak security barriers on a per-table basis, depending on
how sensitive the data is in each table (I'm not a fan of using GUCs
to control this).

Regards,
Dean

INSERT ... OVERRIDING USER VALUE vs GENERATED ALWAYS identity columns

2019-02-22 Thread Dean Rasheed

So I started looking into the bug noted in [1], but before getting to
multi-row inserts, I concluded that the current single-row behaviour
isn't spec-compliant.

In particular, Syntax Rule 11b of section 14.11 says that an INSERT
statement on a GENERATED ALWAYS identity column must specify an
overriding clause, but it doesn't place any restriction on the type of
overriding clause allowed. In other words it should be possible to use
either OVERRIDING SYSTEM VALUE or OVERRIDING USER VALUE, but we
currently throw an error unless it's the former.

It's useful to allow OVERRIDING USER VALUE for precisely the example
use-case given in the INSERT docs:

This clause is useful for example when copying values between tables.
Writing INSERT INTO tbl2 OVERRIDING USER VALUE SELECT * FROM
tbl1 will copy from tbl1 all columns that
are not identity columns in tbl2 while values for
the identity columns in tbl2 will be generated by
the sequences associated with tbl2.

which currently only works for a GENERATED BY DEFAULT identity column,
but should work equally well for a GENERATED ALWAYS identity column.

So I propose the attached patch.

Regards,
Dean


[1] 
https://postgr.es/m/CAEZATCUmSp3-8nLOpgGcPkpUEXK9TJGM%3DiA6q4E2Sn%3D%2BbwkKNA%40mail.gmail.com
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
new file mode 100644
index 22dbc07..b48a29e
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -803,8 +803,8 @@ WITH ( MODULUS INSERT statement
   specifies OVERRIDING SYSTEM VALUE.  If BY
   DEFAULT is specified, then the user-specified value takes
-  precedence.  See  for details.  (In
+  precedence, unless the INSERT statement specifies
+  OVERRIDING USER VALUE.
+  See  for details.  (In
   the COPY command, user-specified values are always
   used regardless of this setting.)
  
 
  
+  Additionally, if ALWAYS is specified, any attempt to
+  update the value of the column using an UPDATE
+  statement specifying any value other than DEFAULT
+  will be rejected. If BY DEFAULT is specified, the
+  system will allow values in the column to be updated.
+ 
+
+ 
   The optional sequence_options clause can be
   used to override the options of the sequence.
   See  for details.
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
new file mode 100644
index 62e142f..8108f31
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -206,10 +206,16 @@ INSERT INTO 

 If this clause is specified, then any values supplied for identity
-columns defined as GENERATED BY DEFAULT are ignored
-and the default sequence-generated values are applied.
+columns are ignored and the default sequence-generated values are
+applied.

 

diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
new file mode 100644
index 7eb41ff..3dc27eb
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -821,13 +821,15 @@ rewriteTargetListIU(List *targetList,
 		{
 			if (att_tup->attidentity == ATTRIBUTE_IDENTITY_ALWAYS && !apply_default)
 			{
-if (override != OVERRIDING_SYSTEM_VALUE)
+if (override == OVERRIDING_USER_VALUE)
+	apply_default = true;
+else if (override != OVERRIDING_SYSTEM_VALUE)
 	ereport(ERROR,
 			(errcode(ERRCODE_GENERATED_ALWAYS),
 			 errmsg("cannot insert into column \"%s\"", NameStr(att_tup->attname)),
 			 errdetail("Column \"%s\" is an identity column defined as GENERATED ALWAYS.",
 	   NameStr(att_tup->attname)),
-			 errhint("Use OVERRIDING SYSTEM VALUE to override.")));
+			 errhint("You must specify either OVERRIDING SYSTEM VALUE or OVERRIDING USER VALUE.")));
 			}
 
 			if (att_tup->attidentity == ATTRIBUTE_IDENTITY_BY_DEFAULT && override == OVERRIDING_USER_VALUE)
diff --git a/src/test/regress/expected/identity.out b/src/test/regress/expected/identity.out
new file mode 100644
index d7d5178..81044b2
--- a/src/test/regress/expected/identity.out
+++ b/src/test/regress/expected/identity.out
@@ -119,28 +119,32 @@ SELECT * FROM itest3;
 
 -- OVERRIDING tests
 INSERT INTO itest1 VALUES (10, 'xyz');
-INSERT INTO itest1 OVERRIDING USER VALUE VALUES (10, 'xyz');
+INSERT INTO itest1 OVERRIDING SYSTEM VALUE VALUES (20, 'xyz');
+INSERT INTO itest1 OVERRIDING USER VALUE VALUES (30, 'xyz');
 SELECT * FROM itest1;
  a  |  b  
 +-
   1 | 
   2 | 
  10 | xyz
+ 20 | xyz
   3 | xyz
-(4 rows)
+(5 rows)
 
 INSERT INTO itest2 VALUES (10, 'xyz');
 ERROR:  cannot insert into column "a"
 DETAIL:  Column "a" is an identity column defined as GENERATED ALWAYS.
-HINT:  Use OVERRIDING SYSTEM VALUE to override.
-INSERT INTO itest2 OVERRIDING SYSTEM VALUE VALUES (10, 'xyz');
+HINT:  You must specify either OVERRIDING SYSTEM VALUE or OVERRIDING USER VALUE.
+INSERT INTO itest2

Re: INSERT ... OVERRIDING USER VALUE vs GENERATED ALWAYS identity columns

2019-02-25 Thread Dean Rasheed

On Mon, 25 Feb 2019 at 12:47, Peter Eisentraut
 wrote:
> > -  and the column in new rows will automatically have values from the
> > -  sequence assigned to it.
> > +  and new rows in the column will automatically have values from the
> > +  sequence assigned to them.
>
> The "it" refers to "the column", so I think it's correct.
>

Ah, OK. I failed to parse the original wording.


> >specifies OVERRIDING SYSTEM VALUE.  If
> BY
> >DEFAULT is specified, then the user-specified value takes
> > -  precedence.  See  for details.  (In
> > +  precedence, unless the INSERT statement
> specifies
> > +  OVERRIDING USER VALUE.
> > +  See  for details.  (In
>
> Isn't your change that it now applies to both ALWAYS and BY DEFAULT?  So
> why attach this phrase to the BY DEFAULT explanation?
>

The last couple of sentences of that paragraph are describing the
circumstances under which the user-specified value will be applied. So
for the ALWAYS case, it's only if OVERRIDING SYSTEM VALUE is
specified, and for the BY DEFAULT case, it's only if OVERRIDING USER
VALUE isn't specified. Without that additional text, the original
wording could be taken to mean that for a BY DEFAULT column, the
user-specified value always gets applied.


> >   
> > +  Additionally, if ALWAYS is specified, any
> attempt to
> > +  update the value of the column using an UPDATE
> > +  statement specifying any value other than
> DEFAULT
> > +  will be rejected. If BY DEFAULT is
> specified, the
> > +  system will allow values in the column to be updated.
> > + 
>
> This is already documented on the INSERT reference page.
>

I can't see anywhere where we document how UPDATE behaves with identity columns.


> > -  errhint("Use 
> > OVERRIDING SYSTEM VALUE to override.")));
> > +  errhint("You must 
> > specify either OVERRIDING SYSTEM VALUE or
> OVERRIDING USER VALUE.")));
>
> Is this a good hint?  If the user wanted to insert something, then
> specifying OVERRIDING USER VALUE won't really accomplish that.
> OVERRIDING USER VALUE is only useful in the specific situations that the
> documentation discussed.  Can we detect those?
>

Hmm, I'm not sure that we reliably guess what the user intended. What
exactly did you have in mind?

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-16 Thread Dean Rasheed

On Fri, 15 Mar 2019 at 00:06, Tomas Vondra  wrote:
> I've noticed an annoying thing when modifying type of column not
> included in a statistics...
>
> That is, we don't remove the statistics, but the estimate still changes.
> But that's because the ALTER TABLE also resets reltuples/relpages:
>
> That's a bit unfortunate, and it kinda makes the whole effort to not
> drop the statistics unnecessarily kinda pointless :-(
>

Well not entirely. Repeating that test with 100,000 rows, I get an
initial estimate of 9850 (actual 10,000), which then drops to 2451
after altering the column. But if you drop the dependency statistics,
the estimate drops to 241, so clearly there is some benefit in keeping
them in that case.

Besides, I thought there was no extra effort in keeping the extended
statistics in this case -- isn't it just using the column
dependencies, so in this case UpdateStatisticsForTypeChange() never
gets called anyway?

Regards,
Dean

Re: pgsql: Add support for hyperbolic functions, as well as log10().

2019-03-14 Thread Dean Rasheed

On Thu, 14 Mar 2019 at 04:41, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > I'm amazed that jacana's asinh() returned -0 for an input of +0.
>
> Even more amusingly, it returns NaN for acosh('infinity'), cf
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2019-03-14%2003%3A00%3A34
>
> Presumably that means they calculated "infinity - infinity" at some
> point, but why?
>

Given the -0 result, I don't find that particularly surprising. I
suspect lots of formulae would end up doing that without proper
special-case handling upfront.

It looks like that's the only platform that isn't POSIX compliant
though, so maybe it's not worth worrying about.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-16 Thread Dean Rasheed

On Fri, 15 Mar 2019 at 00:06, Tomas Vondra  wrote:
> ... attached patch ...

Some more review comments, carrying on from where I left off:

16). This regression test fails for me:

@@ -654,11 +654,11 @@
 -- check change of unrelated column type does not reset the MCV statistics
 ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a =
1 AND b = ''1''');
  estimated | actual
 ---+
-50 | 50
+11 | 50
 (1 row)

Maybe that's platform-dependent, given what you said about
reltuples/relpages being reset. An easy workaround for this would be
to modify this test (and perhaps the one that follows) to just query
pg_statistic_ext to see if the MCV statistics have been reset.

17). I'm definitely preferring the new style of tests because they're
much neater and easier to read, and to directly see the effect of the
extended statistics. One thing I'd consider adding is a query of
pg_statistic_ext using pg_mcv_list_items() after creating the MCV
stats, both to test that function, and to show that the MCV lists have
the expected contents (provided that output isn't too large).

18). Spurious whitespace added to src/backend/statistics/mvdistinct.c.

19). In the function comment for statext_mcv_clauselist_selectivity(),
the name at the top doesn't match the new function name. Also, I think
it should mention MCV in the initial description. I.e., instead of

+/*
+ * mcv_clauselist_selectivity
+ *Estimate clauses using the best multi-column statistics.

it should say:

+/*
+ * statext_mcv_clauselist_selectivity
+ *Estimate clauses using the best multi-column MCV statistics.

20). Later in the same comment, this part should now be deleted:

+ *
+ * So (simple_selectivity - base_selectivity) may be seen as a correction for
+ * the part not covered by the MCV list.

21). For consistency with other bms_ functions, I think the name of
the Bitmapset argument for bms_member_index() should just be called
"a". Nitpicking, I'd also put bms_member_index() immediately after
bms_is_member() in the source, to match the header.

22). mcv_get_match_bitmap() should really use an array of bool rather
than an array of char. Note that a bool is guaranteed to be of size 1,
so it won't make things any less efficient, but it will allow some
code to be made neater. E.g., all clauses like "matches[i] == false"
and "matches[i] != false" can just be made "!matches[i]" or
"matches[i]". Also the Min/Max expressions on those match flags can be
replaced with the logical operators && and ||.

23). Looking at this code in statext_mcv_build():

/* store info about data type OIDs */
i = 0;
j = -1;
while ((j = bms_next_member(attrs, j)) >= 0)
{
VacAttrStats *colstat = stats[i];

mcvlist->types[i] = colstat->attrtypid;
i++;
}

it isn't actually making use of the attribute numbers (j) from attrs,
so this could be simplified to:

/* store info about data type OIDs */
for (i = 0; i < numattrs; i++)
mcvlist->types[i] = stats[i]->attrtypid;

24). Later in that function, the following comment doesn't appear to
make sense. Is this possibly from an earlier version of the code?

/* copy values from the _previous_ group (last item of) */

25). As for (23), in build_mss(), the loop over the Bitmapset of
attributes never actually uses the attribute numbers (j), so that
could just be a loop from i=0 to numattrs-1, and then that function
doesn't need to be passed the Bitmapset at all -- it could just be
passed the integer numattrs.

26). build_distinct_groups() looks like it makes an implicit
assumption that the counts of the items passed in are all zero. That
is indeed the case, if they've come from build_sorted_items(), because
that does a palloc0(), but that feels a little fragile. I think it
would be better if build_distinct_groups() explicitly set the count
each time it detects a new group.

27). In statext_mcv_serialize(), the TODO comment says

 * TODO: Consider packing boolean flags (NULL) for each item into a single char
 * (or a longer type) instead of using an array of bool items.

A more efficient way to save space might be to do away with the
boolean null flags entirely, and just use a special index value like
0x to signify a NULL value.

28). I just spotted the 1MB limit on the serialised MCV list size. I
think this is going to be too limiting. For example, if the stats
target is at its maximum of 1, that only leaves around 100 bytes
for each item's values, which is easily exceeded. In fact, I think
this approach for limiting the MCV list size isn't a good one --
consider what would happen if there were lots of very large values.
Would it run out of memory before getting to that test? Even if not,
it would likely take an excessive amount of time.

I think this part of the patch needs a bit of a rethink. My first

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-17 Thread Dean Rasheed

On Sat, 16 Mar 2019 at 23:44, Tomas Vondra  wrote:
>
> > 21). For consistency with other bms_ functions, I think the name of
> > the Bitmapset argument for bms_member_index() should just be called
> > "a". Nitpicking, I'd also put bms_member_index() immediately after
> > bms_is_member() in the source, to match the header.
>
> I think I've already done the renames in the last patch I submitted (are
> you looking at an older version of the code, perhaps?). I've moved it
> right after bms_is_member - good idea.
>

Ah OK, I was on the 20190315 patch yesterday. I've just updated to the
20190317 patch.
It looks like you forgot to update the argument name in the header file though.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-17 Thread Dean Rasheed

On Sat, 16 Mar 2019 at 23:44, Tomas Vondra  wrote:
>
> > 28). I just spotted the 1MB limit on the serialised MCV list size. I
> > think this is going to be too limiting. For example, if the stats
> > target is at its maximum of 1, that only leaves around 100 bytes
> > for each item's values, which is easily exceeded. In fact, I think
> > this approach for limiting the MCV list size isn't a good one --
> > consider what would happen if there were lots of very large values.
> > Would it run out of memory before getting to that test? Even if not,
> > it would likely take an excessive amount of time.
> >
>
> True. I don't have a very good argument for a specific value, or even
> having an explicit limit at all. I've initially added it mostly as a
> safety for development purposes, but I think you're right we can just
> get rid of it. I don't think it'd run out of memory before hitting the
> limit, but I haven't tried very hard (but I recall running into the 1MB
> limit in the past).
>

I've just been playing around a little with this and found that it
isn't safely dealing with toasted values. For example, consider the
following test:

create or replace function random_string(x int) returns text
as $$
  select substr(string_agg(md5(random()::text), ''), 1, x)
from generate_series(1,(x+31)/32);
$$ language sql;

drop table if exists t;
create table t(a int, b text);
insert into t values (1, random_string(1000));
create statistics s (mcv) on a,b from t;
analyse t;

select length(b), left(b,5), right(b,5) from t;
select length(stxmcv), length((m.values::text[])[2]),
   left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
  from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
 where stxrelid = 't'::regclass;

The final query returns the following:

 length |  length  | left  | right
+--+---+---
250 | 1000 | c2667 | 71492
(1 row)

suggesting that there's something odd about the stxmcv value. Note,
also, that it doesn't hit the 1MB limit, even though the value is much
bigger than that.

If I then delete the value from the table, without changing the stats,
and repeat the final query, it falls over:

delete from t where a=1;
select length(stxmcv), length((m.values::text[])[2]),
   left((m.values::text[])[2], 5), right((m.values::text[])[2],5)
  from pg_statistic_ext, pg_mcv_list_items(stxmcv) m
 where stxrelid = 't'::regclass;

ERROR:  unexpected chunk number 5008 (expected 0) for toast value
16486 in pg_toast_16480

So I suspect it was using the toast data from the table t, although
I've not tried to investigate further.

> > I think this part of the patch needs a bit of a rethink. My first
> > thought is to do something similar to what happens for per-column
> > MCVs, and set an upper limit on the size of each value that is ever
> > considered for inclusion in the stats (c.f. WIDTH_THRESHOLD and
> > toowide_cnt in analyse.c). Over-wide values should be excluded early
> > on, and it will need to track whether or not any such values were
> > excluded, because then it wouldn't be appropriate to treat the stats
> > as complete and keep the entire list, without calling
> > get_mincount_for_mcv_list().
> >
> Which part? Serialization / deserialization? Or how we handle long
> values when building the MCV list?
>

I was thinking (roughly) of something like the following:

* When building the values array for the MCV list, strip out rows with
values wider than some threshold (probably something like the
WIDTH_THRESHOLD = 1024 from analyse.c would be reasonable).

* When building the MCV list, if some over-wide values were previously
stripped out, always go into the get_mincount_for_mcv_list() block,
even if nitems == ngroups (for the same reason a similar thing happens
for per-column stats -- if some items were stripped out, we're already
saying that not all items will go in the MCV list, and it's not safe
to assume that the remaining items are common enough to give accurate
estimates).

* In the serialisation code, remove the size limit entirely. We know
that each value is now at most 1024 bytes, and there are at most 1
items, and at most 8 columns, so the total size is already reasonably
well bounded. In the worst case, it might be around 80MB, but in
practice, it's always likely to be much much smaller than that.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-17 Thread Dean Rasheed

On Sat, 16 Mar 2019 at 23:44, Tomas Vondra  wrote:
> >
> > 16). This regression test fails for me:
> >
> > @@ -654,11 +654,11 @@
> >  -- check change of unrelated column type does not reset the MCV statistics
> >  ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
> >  SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a =
> > 1 AND b = ''1''');
> >   estimated | actual
> >  ---+
> > -50 | 50
> > +11 | 50
> >  (1 row)
> >
> > Maybe that's platform-dependent, given what you said about
> > reltuples/relpages being reset. An easy workaround for this would be
> > to modify this test (and perhaps the one that follows) to just query
> > pg_statistic_ext to see if the MCV statistics have been reset.
> >
>
> Ah, sorry for not explaining this bit - the failure is expected, due to
> the reset of relpages/reltuples I mentioned. We do keep the extended
> stats, but the relsize estimate changes a bit. It surprised me a bit,
> and this test made the behavior apparent. The last patchset included a
> piece that changes that - if we decide not to change this, I think we
> can simply accept the actual output.
>

I don't think changing the way reltuples is reset ought to be within
the scope of this patch. There might be good reasons for it being the
way it is. Perhaps open a discussion on a separate thread?

As far as this test goes, how about just doing this:

-- check change of unrelated column type does not reset the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN d TYPE VARCHAR(64);
SELECT stxmcv IS NOT NULL AS has_mcv
  FROM pg_statistic_ext WHERE stxrelid = 'mcv_lists'::regclass;

-- check change of column type resets the MCV statistics
ALTER TABLE mcv_lists ALTER COLUMN c TYPE numeric;
SELECT stxmcv IS NOT NULL AS has_mcv
  FROM pg_statistic_ext WHERE stxrelid = 'mcv_lists'::regclass;

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Dean Rasheed

On Sun, 10 Mar 2019 at 13:09, Dean Rasheed  wrote:
> Here are some more comments:
>

One more thing --- the comment for statext_clauselist_selectivity() says:

 * So (simple_selectivity - base_selectivity) may be seen as a correction for
 * the part not covered by the MCV list.

That's not quite right. It should really say that (simple_selectivity
- base_selectivity) is an estimate for the part not covered by the MCV
list, or that (mcv_selectivity - base_selectivity) is a correction for
the part covered by the MCV list. Those 2 statements are actually
equivalent, and different from what you wrote.

Perhaps the easiest way to see it is to work through a simple example:

Suppose you had the following clauses:

  a = 1 AND b >= 0 AND b <= 10

The per-column stats might be expected to give reasonable independent
estimates for the following 2 things:

  P(a = 1)

  P(b >= 0 AND b <= 10)  -- in general, greater than P(b >= 0) * P(b <= 10)

but the overall estimate produced by clauselist_selectivity_simple()
would then just be the product of those 2 things:

  simple_sel = P(a = 1) * P(b >= 0 AND b <= 10)

which might not be so good if the columns were correlated.

Now suppose you had MCV stats, which included MCV items for the
following specific values:

  (a=1,b=1), (a=1,b=2), (a=1,b=3)

but no other relevant MCV entries. (There might be lots of other MCV
items that don't match the original clauses, but they're irrelavent
for this discssion.) That would mean that we could get reasonable
estimates for the following 2 quantities:

  mcv_sel = P(a = 1 AND b IN (1,2,3))
  = P(a = 1 AND b = 1) + P(a = 1 AND b = 2) + P(a = 1 AND b = 3)

  mcv_basesel = base_freq(a = 1 AND b IN (1,2,3))
  = P(a = 1) * (P(b = 1) + P(b = 2) + P(b = 3))

So how is that useful? Well, returning to the quantity that we're
actually trying to compute, it can be split into MCV and non-MCV
parts, and since they're mutually exclusive possibilities, their
probabilities just add up. Thus we can write:

  P(a = 1 AND b >= 0 AND b <= 10)

= P(a = 1 AND b IN (1,2,3)) -- MCV part
+ P(a = 1 AND b >= 0 AND b <= 10 AND b NOT IN (1,2,3))  -- non-MCV part

= mcv_sel + other_sel

So the first term is easy -- it's just mcv_sel, from above. The second
term is trickier though, since we have no information about the
correlation between a and b in the non-MCV region. Just about the best
we can do is assume that they're independent, which gives:

  other_sel = P(a = 1 AND b >= 0 AND b <= 10 AND b NOT IN (1,2,3))
   ~= P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

and that can now be written in terms of things that we know

  other_sel ~= P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

 = P(a = 1) * P(b >= 0 AND b <= 10)
 - P(a = 1) * P(b IN (1,2,3))  -- mutually exclusive possibilities

 = simple_sel - mcv_basesel

So, as I said above, (simple_selectivity - base_selectivity) is an
estimate for the part not covered by the MCV list.

Another way to look at it is to split the original per-column estimate
up into MCV and non-MCV parts, and correct the MCV part using the MCV
stats:

  simple_sel = P(a = 1) * P(b >= 0 AND b <= 10)

 = P(a = 1) * P(b IN (1,2,3))
 + P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

The first term is just mcv_basesel, so we can define other_sel to be
the other term, giving

  simple_sel = mcv_basesel  -- MCV part
 + other_sel-- non-MCV part

Clearly mcv_basesel isn't the best estimate for the MCV part, and it
should really be mcv_sel, so we can improve upon simple_sel by
applying a correction of (mcv_sel - basesel) to it:

  better estimate = simple_sel + (mcv_sel - mcv_basesel)
  = mcv_sel + other_sel

(where other_sel = simple_sel - mcv_basesel)

Of course, that's totally equivalent, but looking at it this way
(mcv_selectivity - base_selectivity) can be seen as a correction for
the part covered by the MCV list.

All of that generalises to arbitrary clauses, because the matching
items in the MCV list are independent possibilities that sum up, and
the MCV and non-MCV parts are mutually exclusive. That's also why the
basesel calculation in mcv_clauselist_selectivity() must only include
matching MCV items, and the following XXX comment is wrong:

+   for (i = 0; i < mcv->nitems; i++)
+   {
+   *totalsel += mcv->items[i]->frequency;
+
+   if (matches[i] != STATS_MATCH_NONE)
+   {
+   /* XXX Shouldn't the basesel be outside the if condition? */
+   *basesel += mcv->items[i]->base_frequency;
+   s += mcv->items[i]->frequency;
+   }
+   }

So I believe that that code is correct, as written.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-11 Thread Dean Rasheed

On Sun, 10 Mar 2019 at 17:36, Tomas Vondra  wrote:
> On 3/10/19 2:09 PM, Dean Rasheed wrote:
> > 14). The attnums Bitmapset passed to
> > statext_is_compatible_clause_internal() is an input/output argument
> > that it updates. That should probably be documented. When it calls
> > itself recursively for AND/OR/NOT clauses, it could just pass the
> > original Bitmapset through to be updated (rather than creating a new
> > one and merging it), as it does for other types of clause.
>
> I don't think it's really possible, because the AND/OR/NOT clause is
> considered compatible only when all the pieces are compatible. So we
> can't tweak the original bitmapset directly in case the incompatible
> clause is not the very first one.
>

In the case where the overall clause is incompatible, you don't
actually care about the attnums returned. Right now it will return an
empty set (NULL). With this change it would return all the attnums
encountered before the incompatible piece, but that wouldn't matter.
In fact, you could easily preserve the current behaviour just by
having the outer statext_is_compatible_clause() function set attnums
back to NULL if the result is false.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-11 Thread Dean Rasheed

On Sun, 10 Mar 2019 at 22:28, David Rowley  wrote:
>
> On Mon, 11 Mar 2019 at 06:36, Tomas Vondra  
> wrote:
> >
> > On 3/9/19 7:33 PM, Dean Rasheed wrote:
> > > I wonder if it's possible to write smaller, more targeted tests.
> > > Currently "stats_ext" is by far the slowest test in its group, and I'm
> > > not sure that some of those tests add much. It ought to be possible to
> > > write a function that calls EXPLAIN and returns a query's row
> > > estimate, and then you could write tests to confirm the effect of the
> > > new stats by verifying the row estimates change as expected.
> >
> > Sure, if we can write more targeted tests, that would be good. But it's
> > not quite clear to me how wrapping EXPLAIN in a function makes those
> > tests any faster?
>
> I've not looked at the tests in question, but if they're executing an
> inferior plan is used when no extended stats exists, then maybe that's
> why they're slow.
>
> I think Dean might mean to create a function similar to
> explain_parallel_append() in partition_prune.sql then write tests that
> check the row estimate with EXPLAIN (COSTS ON) but strip out the other
> costing stuff instead of validating that the poor plan was chosen.
>

Yeah that's the sort of thing I was thinking of. I think it might be
possible to write simpler and faster tests by inserting far fewer rows
and relying on ANALYSE having sampled everything, so the row estimates
should be predictable. It may be the case that, with just a handful of
rows, the extended stats don't affect the plan, but you'd still see a
difference in the row estimates, and that could be a sufficient test I
think.

> > On 3/10/19 2:09 PM, Dean Rasheed wrote:
> > > 12). bms_member_index() should surely be in bitmapset.c. It could be
> > > more efficient by just traversing the bitmap words and making use of
> > > bmw_popcount(). Also, its second argument should be of type 'int' for
> > > consistency with other bms_* functions.
> >
> > Yes, moving to bitmapset.c definitely makes sense. I don't see how it
> > could use bms_popcount() though.
>
> I think it could be done by first checking if the parameter is a
> member of the set, and then if so, count all the bits that come on and
> before that member. You can use bmw_popcount() for whole words before
> the specific member's word then just bitwise-and a bit mask of a
> bitmapword that has all bits set for all bits on and before your
> parameter's BITNUM(), and add the bmw_popcount of the final word
> bitwise-anding the mask. bms_add_range() has some masking code you
> could copy.

Yep, that's what I was imagining. Except I think that to get a 0-based
index result you'd want the mask to have all bits set for bits
*before* the parameter's BITNUM(), rather than on and before. So I
think the mask would simply be

  ((bitmapword) 1 << bitnum) - 1

Regards,
Dean

Re: [Patch] Log10 and hyperbolic functions for SQL:2016 compliance

2019-03-11 Thread Dean Rasheed

On Sun, 3 Feb 2019 at 15:12, Tom Lane  wrote:
>
> Andrew Gierth  writes:
> > The spec doesn't require the inverse functions (asinh, acosh, atanh),
> > but surely there is no principled reason to omit them?
>
> +1 --- AFAICS, the C library has offered all six since C89.
>

+1 for including the inverse functions. However, it looks to me like
the inverse functions are C99-specific, so they might not be available
on all supported platforms. If they're not, we may need to provide our
own implementations.

I did a bit of research and had play. It looks like it might not be
too hard to provide our own implementations, but it does take a bit of
care to avoid rounding and overflow errors. Attached are some
standalone C programs where I tested various algorithms. A decent
approach seems to be to either use log1p() (which is itself
C99-specific, hence that's the first thing I played with) or to use a
single round of Newton-Raphson to correct rounding errors, in a
similar way to how we implement cbrt() on platforms that don't have
that.

Of course that may all be moot -- those functions may in fact be
available everywhere we care about, but it was interesting to play
around with them anyway.

Regards,
Dean
#include 
#include 
#include 

/*
 * A commonly used algorithm, due to Kahan (citation needed).
 *
 * Max error: 1 ulp.
 * Avg error: 0.158 ulp.
 *
 * It's not obvious, but this does appear to be monotonic at the cutover point
 * (at least on my machine). Can that be relied upon on all platforms?
 *
 * -5.5511151231257851673e-17 -> -5.5511151231257851673e-17 (u != 1 branch)
 * -5.5511151231257839347e-17 -> -5.5511151231257839347e-17 (u != 1 branch)
 * -5.5511151231257827021e-17 -> -5.5511151231257827021e-17 (u == 1 branch)
 * -5.5511151231257820858e-17 -> -5.5511151231257820858e-17 (u == 1 branch)
 *
 * 1.1102230246251564172e-16 -> 1.1102230246251564172e-16 (u == 1 branch)
 * 1.1102230246251565404e-16 -> 1.1102230246251565404e-16 (u == 1 branch)
 * 1.1102230246251567869e-16 -> 1.1102230246251565404e-16 (u != 1 branch)
 * 1.1102230246251570335e-16 -> 1.1102230246251567869e-16 (u != 1 branch)
 *
 * However, it doesn't appear to be monotonic at various other points.
 */
double kahan_log1p(double x)
{
	volatile double u = 1+x;
	return u == 1 ? x : x * (log(u) / (u-1));
}

/*
 * Algorithm used in the GNU Scientific Library.
 *
 * Max error: 1 ulp.
 * Avg error: 0.086 ulp.
 *
 * Where does this formula come from? Licensing?
 * Doesn't return -0 when x is -0, but that could be fixed up.
 * Looks to be monotonic everywhere.
 */
double gsl_log1p(double x)
{
	volatile double y, z;
	y = 1 + x;
	z = y - 1;
	return log(y) - (z-x)/y;
}

/*
 * Alternative algorithm using one round of Newton-Raphson.
 *
 * Max error: 1 ulp.
 * Avg error: 0.094 ulp.
 *
 * Works well for all inputs.
 * Looks to be monotonic everywhere.
 */
double nr_log1p(double x)
{
	double y, e;

	if (x == 0)
		return x;

	y = log(1+x);
	e = exp(y);

	return y - (e-1-x)/e;
}

/* === TESTS === */

int num_tests = 0;
double max_kahan_err = 0.0;
double max_gsl_err = 0.0;
double max_nr_err = 0.0;
double total_kahan_err = 0.0;
double total_gsl_err = 0.0;
double total_nr_err = 0.0;

double err(double x, double y)
{
	if (x < y)
	{
		double diff = y - x;
		double ulp = nextafter(x, y) - x;
		return diff / ulp;
	}
	if (x > y)
	{
		double diff = x - y;
		double ulp = nextafter(y, x) - y;
		return diff / ulp;
	}
	return 0.0;
}

void test_log1p(double x)
{
	double y = log1p(x);
	double kahan_y = kahan_log1p(x);
	double gsl_y = gsl_log1p(x);
	double nr_y = nr_log1p(x);
	double kahan_err = err(y, kahan_y);
	double gsl_err = err(y, gsl_y);
	double nr_err = err(y, nr_y);

	double prev_x = nextafter(x, -DBL_MAX);
	double next_x = nextafter(x, DBL_MAX);
#define monotonic(fn) \
	( prev_x == -1 || (fn)(prev_x) <= (fn)(x) ) && \
	( next_x == DBL_MAX || (fn)(next_x) >= (fn)(x) ) ? \
	"Monotonic" : "Not monotonic"

	printf("x = %.20g\n", x);
	printf("Standard result: %.20g %s\n", y, monotonic(log1p));
	printf("Kahan log1p():   %.20g,  err=%f %s\n", kahan_y, kahan_err,
		   monotonic(kahan_log1p));
	printf("GSL log1p(): %.20g,  err=%f %s\n", gsl_y, gsl_err,
		   monotonic(gsl_log1p));
	printf("NR log1p():  %.20g,  err=%f %s\n", nr_y, nr_err,
		   monotonic(nr_log1p));

	if (kahan_err > max_kahan_err) max_kahan_err = kahan_err;
	if (gsl_err > max_gsl_err) max_gsl_err = gsl_err;
	if (nr_err > max_nr_err) max_nr_err = nr_err;

	total_kahan_err += kahan_err;
	total_gsl_err += gsl_err;
	total_nr_err += nr_err;
	num_tests++;
}

int main()
{
	test_log1p(nextafter(-1, 0));
	test_log1p(3.141e-16-1);
	test_log1p(3.141e-15-1);
	test_log1p(3.141e-14-1);
	test_log1p(3.141e-13-1);
	test_log1p(3.141e-12-1);
	test_log1p(3.141e-11-1);
	test_log1p(3.141e-10-1);
	test_log1p(3.141e-9-1);
	test_log1p(3.141e-8-1);
	test_log1p(3.141e-7-1);
	test_log1p(3.141e-6-1);
	test_log1p(3.141e-5-1);
	test_log1p(3.141e-4-1);
	test_log1p(3.141e-3-1);
	test_log1p(3.141e-2-1);
	test_log1p(-6.859e-1);

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-09 Thread Dean Rasheed

On Thu, 28 Feb 2019 at 19:56, Tomas Vondra  wrote:
> Attached is an updated version of this patch series.

Here are some random review comments. I'll add more later, but I'm out
of energy for today.

1). src/test/regress/expected/type_sanity.out has bit-rotted.

2). Duplicate OIDs (3425).

3). It looks a bit odd that clauselist_selectivity() calls
statext_clauselist_selectivity(), which does MCV stats and will do
histograms, but it doesn't do dependencies, so
clauselist_selectivity() has to then separately call
dependencies_clauselist_selectivity(). It would seem neater if
statext_clauselist_selectivity() took care of calling
dependencies_clauselist_selectivity(), since dependencies are just
another kind of extended stats.

4). There are no tests for pg_mcv_list_items(). Given a table with a
small enough amount of data, so that it's all sampled, it ought to be
possible to get predictable MCV stats.

5). It's not obvious what some of the new test cases in the
"stats_ext" tests are intended to show. For example, the first test
creates a table with 5000 rows and a couple of indexes, does a couple
of queries, builds some MCV stats, and then repeats the queries, but
the results seem to be the same with and without the stats.

I wonder if it's possible to write smaller, more targeted tests.
Currently "stats_ext" is by far the slowest test in its group, and I'm
not sure that some of those tests add much. It ought to be possible to
write a function that calls EXPLAIN and returns a query's row
estimate, and then you could write tests to confirm the effect of the
new stats by verifying the row estimates change as expected.

6). This enum isn't needed for MCVs:

/*
 * Degree of how much MCV item matches a clause.
 * This is then considered when computing the selectivity.
 */
#define STATS_MATCH_NONE0/* no match at all */
#define STATS_MATCH_PARTIAL1/* partial match */
#define STATS_MATCH_FULL2/* full match */

STATS_MATCH_PARTIAL is never used for MCVs, so you may as well just
use booleans instead of this enum. If those are needed for histograms,
they can probably be made local to histogram.c.

7). estimate_num_groups_simple() isn't needed in this patch.

8). In README.mcv,
s/clauselist_mv_selectivity_mcvlist/mcv_clauselist_selectivity/.

9). In the list of supported clause types that follows (e) is the same
a (c), but with a more general description.

10). It looks like most of the subsequent description of the algorithm
is out of date and needs rewriting. All the stuff about full matches
and the use of ndistinct is now obsolete.

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Dean Rasheed

On Sat, 9 Mar 2019 at 18:33, Dean Rasheed  wrote:
>
> On Thu, 28 Feb 2019 at 19:56, Tomas Vondra  
> wrote:
> > Attached is an updated version of this patch series.
>
> Here are some random review comments. I'll add more later, but I'm out
> of energy for today.
>

Here are some more comments:

11). In dependency_degree():

-/* sort the items so that we can detect the groups */
-qsort_arg((void *) items, numrows, sizeof(SortItem),
-  multi_sort_compare, mss);
+/*
+ * build an array of SortItem(s) sorted using the multi-sort support
+ *
+ * XXX This relies on all stats entries pointing to the same tuple
+ * descriptor. Not sure if that might not be the case.
+ */
+items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
+   mss, k, attnums_dep);

That XXX comment puzzled me for a while. Actually it's OK though,
unless/until we try to support stats across multiple relations, which
will require a much larger refactoring of this code. For now though,
The stats entries all point to the same tuple descriptor from the
onerel passed to BuildRelationExtStatistics(), so it's OK to just use
the first tuple descriptor in this way. The comment should be updated
to explain that.

12). bms_member_index() should surely be in bitmapset.c. It could be
more efficient by just traversing the bitmap words and making use of
bmw_popcount(). Also, its second argument should be of type 'int' for
consistency with other bms_* functions.

13). estimate_ndistinct() has been moved from mvdistinct.c to
extended_stats.c and changed from static to extern, but it is only
called from mvdistinct.c, so that change is unnecessary (at least as
far as this patch is concerned).

14). The attnums Bitmapset passed to
statext_is_compatible_clause_internal() is an input/output argument
that it updates. That should probably be documented. When it calls
itself recursively for AND/OR/NOT clauses, it could just pass the
original Bitmapset through to be updated (rather than creating a new
one and merging it), as it does for other types of clause.

On the other hand, the outer function statext_is_compatible_clause()
does need to return a new bitmap, which may or may not be used by its
caller, so it would be cleaner to make that a strictly "out" parameter
and initialise it to NULL in that function, rather than in its caller.

15). As I said yesterday, I don't think that there is a clean
separator of concerns between the functions clauselist_selectivity(),
statext_clauselist_selectivity(),
dependencies_clauselist_selectivity() and
mcv_clauselist_selectivity(), I think things could be re-arranged as
follows:

statext_clauselist_selectivity() - as the name suggests - should take
care of *all* extended stats estimation, not just MCVs and histograms.
So make it a fairly small function, calling
mcv_clauselist_selectivity() and
dependencies_clauselist_selectivity(), and histograms when that gets
added.

Most of the current code in statext_clauselist_selectivity() is really
MCV-specific, so move that to mcv_clauselist_selectivity(). Amongst
other things, that would move the call to choose_best_statistics() to
mcv_clauselist_selectivity() (just as
dependencies_clauselist_selectivity() calls choose_best_statistics()
to get the best dependencies statistics). Then, when histograms are
added later, you won't have the problem pointed out before where it
can't apply both MCV and histogram stats if they're on different
STATISTICS objects.

Most of the comments for statext_clauselist_selectivity() are also
MCV-related. Those too would move to mcv_clauselist_selectivity().

Regards,
Dean

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-02-06 Thread Dean Rasheed

On Wed, 6 Feb 2019 at 23:44, Tomas Vondra  wrote:
>
> On 2/6/19 10:59 PM, David Rowley wrote:
> > On Thu, 7 Feb 2019 at 03:16, Alvaro Herrera  
> > wrote:
> >> I wonder what should we be doing with this series -- concretely, should
> >> the effort concentrate on one of the two patches, and leave the other
> >> for pg13, to increase the chances of the first one being in pg12?  I
> >> would favor that approach, since it's pretty late in the cycle by now
> >> and it seems dubious that both will be ready.
> >
> > I mostly have been reviewing the MCV patch with the thoughts that one
> > is better than none in PG12.  I don't see any particular reason that
> > we need both in the one release.
> >
>
> I agree with that, although most of the complexity likely lies in
> integrating the stats into the selectivity estimation - if we get that
> right for the MCV patch, adding histogram seems comparably simpler.
>
> But yeah, let's focus on the MCV part.
>

Agreed. I think the overall approach of the MCV patch is sound and
it's getting closer to being committable. David's review comments were
excellent. I'll try to review it as well when you post your next
update.

I have some more fundamental doubts about the histogram patch, to do
with the way it integrates with selectivity estimation, and some vague
half-formed ideas about how that could be improved, but nothing clear
enough that I can express right now.

So yes, let's focus on the MCV patch for now.

Regards,
Dean

Multivariate MCV lists -- pg_mcv_list_items() seems to be broken

2019-04-15 Thread Dean Rasheed

I just noticed the following:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10, x/100 FROM generate_series(1, 100) x;
CREATE STATISTICS foo_s ON a,b FROM foo;
ANALYSE foo;

SELECT pg_mcv_list_items(stxmcv) from pg_statistic_ext WHERE stxname = 'foo_s';

which fails with

ERROR:  cache lookup failed for type 0

That definitely used to work, so I'm guessing it got broken by the
recent reworking of the serialisation code, but I've not looked into
it.

There should probably be regression test coverage of that function.

Regards,
Dean

Re: pgsql: Add support for hyperbolic functions, as well as log10().

2019-03-13 Thread Dean Rasheed

On Wed, 13 Mar 2019, 21:56 Tom Lane,  wrote:

>
> Of these, probably the least bad is #3, even though it might require
> a few rounds of experimentation to find the best extra_float_digits
> setting to use.  I'll go try it without any roundoff, just to see
> what the raw results look like ...
>

Yeah, that seems like a reasonable thing to try.

I'm amazed that jacana's asinh() returned -0 for an input of +0. I'm not
aware of any implementation that does that. I'd be quite interested to know
what it returned for an input like 1e-20. If that returned any variety of
zero, I'd say that it's worse than useless. Another interesting test case
would be whether or not it satisfies asinh(-x) = -asinh(x) for a variety of
different values of x, because that's something that commonly breaks down
badly with naive implementations.

It's not unreasonable to expect these functions to be accurate to within
the last 1 or 2 digits, so testing with extra_float_digits or whatever
seems reasonable, but I think a wider variety of test inputs is required.

I also wonder if we should be doing what we do for the regular trig
functions and explicitly handle special cases like Inf and NaN to ensure
POSIX compatibility on all platforms.

Regards,
Dean

>

Re: Choosing values for multivariate MCV lists

2019-06-19 Thread Dean Rasheed

On Tue, 18 Jun 2019 at 21:59, Tomas Vondra  wrote:
>
> The current implementation of multi-column MCV lists (added in this
> cycle) uses a fairly simple algorithm to pick combinations to include in
> the MCV list. We just compute a minimum number of occurences, and then
> include all entries sampled more often. See get_mincount_for_mcv_list().
>
> [snip]
>
> This however means that by choosing the MCV entries solely based on the
> number of occurrences in the sample, we end up with MCV lists where vast
> majority of entries has NULL street name.
>
> That's why we got such poor estimate in the example query, despite the
> fact that the city/street combination is the most frequent in the data
> set (with non-NULL street name).
>

I think the fact that they're NULL is a bit of a red herring because
we're treating NULL just like any other value. The same thing would
happen if there were some other very common non-NULL value that
dominated the dataset.

> The other weird thing is that frequency of NULL street names is fairly
> uniform in the whole data set. In total about 50% addresses match that,
> and for individual cities it's generally between 25% and 100%, so the
> estimate is less than 2x off in those cases.
>
> But for addresses with non-NULL street names, the situation is very
> different. Some street names are unique to a single city, etc.
>
> Overall, this means we end up with MCV list with entries representing
> the mostly-uniform part of the data set, instead of prefering the
> entries that are truly skewed.
>
> So I'm wondering if we could/should rethink the algorithm, so look at
> the frequency and base_frequency, and maybe pick entries based on their
> ratio, or something like that.
>

Hmm, interesting. I think I would like to see a more rigorous
justification for changing the algorithm deciding which values to
keep.

If I've understood correctly, I think the problem is this: The
mincount calculation is a good way of identifying MCV candidates to
keep, because it ensures that we don't keep values that don't appear
sufficiently often to produce accurate estimates, and ideally we'd
keep everything with count >= mincount. However, in the case were
there are more than stats_target items with count >= mincount, simply
ordering by count and keeping the most commonly seen values isn't
necessarily the best strategy in the case of multivariate statistics.

To identify what the best strategy might be, I think you need to
examine the errors that would occur as a result of *not* keeping a
value in the multivariate MCV list. Given a value that appears with
count >= mincount, N*freq ought to be a reasonable estimate for the
actual number of occurrences of that value in the table, and
N*base_freq ought to be a reasonable estimate for the
univariate-stats-based estimate that it would be given if it weren't
kept in the multivariate MCV list. So the absolute error resulting
from not keeping that value would be

  N * Abs(freq - base_freq)

But then I think we ought to take into account how often we're likely
to get that error. If we're simply picking values at random, the
likelihood of getting that value is just its frequency, so the average
average absolute error would be

  Sum( N * freq[i] * Abs(freq[i] - base_freq[i]) )

which suggests that, if we wanted to reduce the average absolute error
of the estimates, we should order by freq*Abs(freq-base_freq) and keep
the top n in the MCV list.

On the other hand, if we wanted to reduce the average *relative* error
of the estimates, we might instead order by Abs(freq-base_freq).

> For example, we might sort the entries by
>
> abs(freq - base_freq) / freq
>

I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
though, because that would seem likely to put too much weight on the
least commonly occurring values.

> Of course, this is a challenging problem, for a couple of reasons.
>
> Firstly, picking simply the most frequent groups is very simple and it
> gives us additional information about the largest group (which may be
> useful elsewhere, e.g. the incremental sort patch).
>

Yes, you would have to keep in mind that changing the algorithm would
mean that the MCV list no longer represented all the most common
values. For example, it would no longer be valid to assume that no
value appeared more often than the first value in the MCV list. I'm
not sure that we currently do that though.

> Secondly, if the correlated combinations are less frequent, how reliably
> can we even estimate the frequency from a sample? The combination in the
> example query was ~0.02% of the data, so how likely it's to be sampled?
>

I think that's OK as long as we keep the mincount filter in the new
algorithm. I think some experimentation is definitely worthwhile here,
but it looks plausible that a decent approach might be:

1). Discard all values with count < mincount.
2). Order by freq*Abs(freq-base_freq) (or possibly just
Abs(freq-base_freq)) and keep the top n,

Re: Multivariate MCV list vs. statistics target

2019-06-21 Thread Dean Rasheed

On Thu, 20 Jun 2019 at 23:12, Tomas Vondra  wrote:
>
> On Thu, Jun 20, 2019 at 08:08:44AM +0100, Dean Rasheed wrote:
> >On Tue, 18 Jun 2019 at 22:34, Tomas Vondra  
> >wrote:
> >>
> >> So I'm thinking we should allow tweaking the statistics for extended
> >> stats, and serialize it in the pg_statistic_ext catalog. Any opinions
> >> why that would be a bad idea?
> >
> >Seems reasonable to me. This might not be the only option we'll ever
> >want to add though, so perhaps a "stxoptions text[]" column along the
> >lines of a relation's reloptions would be the way to go.
>
> I don't know - I kinda dislike the idea of stashing stuff like this into
> text[] arrays unless there's a clear need for such flexibility (i.e.
> vision to have more such options). Which I'm not sure is the case here.
> And we kinda have a precedent in pg_attribute.attstattarget, so I'd use
> the same approach here.
>

Hmm, maybe. I can certainly understand your dislike of using text[].
I'm not sure that we can confidently say that multivariate statistics
won't ever need additional tuning knobs, but I have no idea at the
moment what they might be, and nothing else has come up so far in all
the time spent considering MCV lists and histograms, so maybe this is
OK.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

2019-06-21 Thread Dean Rasheed

On Thu, 20 Jun 2019 at 23:35, Tomas Vondra  wrote:
>
> On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>
> >I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
> >though, because that would seem likely to put too much weight on the
> >least commonly occurring values.
>
> But would that be an issue, or a good thing? I mean, as long as the item
> is above mincount, we take the counts as reliable. As I explained, my
> motivation for proposing that was that both
>
>... (cost=... rows=1 ...) (actual=... rows=101 ...)
>
> and
>
>... (cost=... rows=100 ...) (actual=... rows=200 ...)
>
> have exactly the same Abs(freq - base_freq), but I think we both agree
> that the first misestimate is much more dangerous, because it's off by six
> orders of magnitude.
>

Hmm, that's a good example. That definitely suggests that we should be
trying to minimise the relative error, but also perhaps that what we
should be looking at is actually just the ratio freq / base_freq,
rather than their difference.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-06-23 Thread Dean Rasheed

On Mon, 13 May 2019 at 23:36, Tomas Vondra  wrote:
>
> On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:
> >While working on 1aebfbea83c, I noticed that the new multivariate MCV
> >stats feature suffers from the same problem, and also the original
> >problems that were fixed in e2d4ef8de8 and earlier --- namely that a
> >user can see values in the MCV lists that they shouldn't see (values
> >from tables that they don't have privileges on).
> >
> >I think there are 2 separate issues here:
> >
> >1). The table pg_statistic_ext is accessible to anyone, so any user
> >can see the MCV lists of any table. I think we should give this the
> >same treatment as pg_statistic, and hide it behind a security barrier
> >view, revoking public access from the table.
> >
> >2). The multivariate MCV stats planner code can be made to invoke
> >user-defined operators, so a user can create a leaky operator and use
> >it to reveal data values from the MCV lists even if they have no
> >permissions on the table.
> >
> >Attached is a draft patch to fix (2), which hooks into
> >statext_is_compatible_clause().
> >
>
> I think that patch is good.
>

I realised that we forgot to push this second part, so I've just done so.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

2019-06-23 Thread Dean Rasheed

On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  wrote:
> One annoying thing I noticed is that the base_frequency tends to end up
> being 0, most likely due to getting too small. It's a bit strange, though,
> because with statistic target set to 10k the smallest frequency for a
> single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> something the float8 can represent).
>

Yeah, it should be impossible for the base frequency to underflow to
0. However, it looks like the problem is with mcv_list_items()'s use
of %f to convert to text, which is pretty ugly.

Regards,
Dean

Re: Choosing values for multivariate MCV lists

2019-06-24 Thread Dean Rasheed

On Mon, 24 Jun 2019 at 00:42, Tomas Vondra  wrote:
>
> On Sun, Jun 23, 2019 at 10:23:19PM +0200, Tomas Vondra wrote:
> >On Sun, Jun 23, 2019 at 08:48:26PM +0100, Dean Rasheed wrote:
> >>On Sat, 22 Jun 2019 at 15:10, Tomas Vondra  
> >>wrote:
> >>>One annoying thing I noticed is that the base_frequency tends to end up
> >>>being 0, most likely due to getting too small. It's a bit strange, though,
> >>>because with statistic target set to 10k the smallest frequency for a
> >>>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
> >>>something the float8 can represent).
> >>>
> >>
> >>Yeah, it should be impossible for the base frequency to underflow to
> >>0. However, it looks like the problem is with mcv_list_items()'s use
> >>of %f to convert to text, which is pretty ugly.
> >>
> >
> >Yeah, I realized that too, eventually. One way to fix that would be
> >adding %.15f to the sprintf() call, but that just adds ugliness. It's
> >probably time to rewrite the function to build the tuple from datums,
> >instead of relying on BuildTupleFromCStrings.
> >
>
> OK, attached is a patch doing this. It's pretty simple, and it does
> resolve the issue with frequency precision.
>
> There's one issue with the signature, though - currently the function
> returns null flags as bool array, but values are returned as simple
> text value (formatted in array-like way, but still just a text).
>
> In the attached patch I've reworked both to proper arrays, but obviously
> that'd require a CATVERSION bump - and there's not much apetite for that
> past beta2, I suppose. So I'll just undo this bit.
>

Hmm, I didn't spot that the old code was using a single text value
rather than a text array. That's clearly broken, especially since it
wasn't even necessarily constructing a valid textual representation of
an array (e.g., if an individual value's textual representation
included the array markers "{" or "}").

IMO fixing this to return a text array is worth doing, even though it
means a catversion bump.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-06-10 Thread Dean Rasheed

On Thu, 6 Jun 2019 at 21:33, Tomas Vondra  wrote:
>
> Hi,
>
> Attached are three patches tweaking the stats - two were already posted
> in this thread, the third one is just updating docs.
>
> 1) 0001 - split pg_statistic_ext into definition + data
>
> This is pretty much the patch Dean posted some time ago, rebased to
> current master (fixing just minor pgindent bitrot).
>
> 2) 0002 - update sgml docs to reflect changes from 0001
>
> 3) 0003 - define pg_stats_ext view, similar to pg_stats
>

Seems reasonable on a quick read-through, except I spotted a bug in
the view (my fault) -- the statistics_owner column should come from
s.stxowner rather than c.relowner.


> The question is whether we want to also redesign pg_statistic_ext_data
> per Tom's proposal (more about that later), but I think we can treat
> that as an additional step on top of 0001. So I propose we get those
> changes committed, and then perhaps also switch the data table to the
> EAV model.
>
> Barring objections, I'll do that early next week, after cleaning up
> those patches a bit more.
>
> One thing I think we should fix is naming of the attributes in the 0001
> patch. At the moment both catalogs use "stx" prefix - e.g. "stxkind" is
> in pg_statistic_ext, and "stxmcv" is in pg_statistic_ext_data. We should
> probably switch to "stxd" in the _data catalog. Opinions?
>

Yes, that makes sense. Especially when joining the 2 tables, since it
makes it more obvious which table a given column is coming from in a
join clause.


> Now, back to the proposal to split the _data catalog rows to EAV form,
> with a new data type replacing the multiple types we have at the moment.
> I've started hacking on it today, but the more I work on it the less
> useful it seems to me.
>
> My understanding is that with that approach we'd replace the _data
> catalog (which currently has one column per statistic type, with a
> separate data type) with 1:M generic rows, with a generic data type.
> That is, we'd replace this
>
> CREATE TABLE pg_statistic_ext_data (
> stxoid OID,
> stxdependencies pg_dependencies,
> stxndistinct pg_ndistinct,
> stxmcv pg_mcv_list,
> ... histograms ...
> );
>
> with something like this:
>
> CREATE TABLE pg_statistiex_ext_data (
> stxoid OID,
> stxkind CHAR,
> stxdata pg_statistic_ext_type
> );
>
> where pg_statistic_ext would store all existing statistic types. along
> with a "flag" saying which value it actually stored (essentially a copy
> of the stxkind column, which we however need to lookup a statistic of a
> certain type, without having to detoast the statistic itself).
>
> As I mentioned before, I kinda dislike the fact that this obfuscates the
> actual statistic type by hiding it behing the "wrapper" type.
>
> The other thing is that we have to deal with 1:M relationship every time
> we (re)build the statistics, or when we need to access them. Now, it may
> not be a huge amount of code, but it just seems unnecessary. It would
> make sense if we planned to add large number of additional statistic
> types, but that seems unlikely - I personally can think of maybe one new
> statistic type, but that's about it.
>
> I'll continue working on it and I'll share the results early next week,
> after playing with it a bit, but I think we should get the existing
> patches committed and then continue discussing this as an additional
> improvement.
>

I wonder ... would it be completely crazy to just use a JSON column to
store the extended stats data?

It wouldn't be as compact as your representation, but it would allow
for future stats kinds without changing the catalog definitions, and
it wouldn't obfuscate the stats types. You could keep the 1:1
relationship, and have top-level JSON keys for each stats kind built,
and you wouldn't need the pg_mcv_list_items() function because you
could just put the MCV data in JSON arrays, which would be much more
transparent, and would make the user-accessible view much simpler. One
could also imagine writing regression tests that checked for specific
expected MCV values like "stxdata->'mcv'->'frequency'->0".

Just a thought.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-18 Thread Dean Rasheed

On Sat, 18 May 2019 at 16:13, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > On the other hand, pg_dump relies on pg_statistic_ext to work out
> > which extended statistics objects to dump. If we were to change that
> > to use pg_stats_ext, then a user dumping a table with RLS using the
> > --enable-row-security flag wouldn't get any extended statistics
> > objects, which would be a somewhat surprising result.
>
> It seems like what we need here is to have a separation between the
> *definition* of a stats object (which is what pg_dump needs access
> to) and the current actual *data* in it.  I'd have expected that
> keeping those in separate catalogs would be the thing to do, though
> perhaps it's too late for that.
>

Yeah, with the benefit of hindsight, that would have made sense, but
that seems like a pretty big change to be attempting at this stage.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-19 Thread Dean Rasheed

On Sun, 19 May 2019 at 15:28, Tom Lane  wrote:
>
> > I wonder ... another way we could potentially do this is
>
> > create table pg_statistic_ext_data(
> > stxoid oid,  -- OID of owning pg_statistic_ext entry
> > stxkind char, -- what kind of data
> > stxdata bytea -- the data, in some format or other
> > );
>
> > The advantage of this way is that we'd not have to rejigger the
> > catalog's rowtype every time we think of a new kind of extended
> > stats.  The disadvantage is that manual inspection of the contents
> > of an entry would become much harder, for lack of any convenient
> > output function.
>
> No, wait, scratch that.  We could fold the three existing types
> pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
> "pg_stats_ext_data", where the actual storage would need to have an
> ID field (so we'd waste a byte or two duplicating the externally
> visible stxkind field inside stxdata).  The output function for this
> type is just a switch over the existing code.  The big advantage of
> this way compared to the current approach is that adding a new
> ext-stats type requires *zero* work with adding new catalog entries.
> Just add another switch case in pg_stats_ext_data_out() and you're
> done.
>

This feels a little over-engineered to me. Presumably there'd be a
compound key on (stxoid, stxkind) and we'd have to scan multiple rows
to get all the applicable stats, whereas currently they're all in one
row. And then the user-accessible view would probably need separate
sub-queries for each stats kind.

If the point is just to avoid adding columns to the catalog in future
releases, I'm not sure it's worth the added complexity. We know that
we will probably add histogram stats in a future release. I'm not sure
how many more kinds we'll end up adding, but it doesn't seem likely to
be a huge number. I think we'll add far more columns to other catalog
tables as we add new features to each release.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-19 Thread Dean Rasheed

On Sun, 19 May 2019 at 00:48, Stephen Frost  wrote:
>
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Tomas Vondra  writes:
> >
> > > I think we have four options - rework it before beta1, rework it after
> > > beta1, rework it in PG13 and leave it as it is now.
> >
> > Yup, that's about what the options are.  I'm just voting against
> > "change it in v13".  If we're going to change it, then the fewer
> > major versions that have the bogus definition the better --- and
> > since we're changing that catalog in v12 anyway, users will see
> > fewer distinct behaviors if we do this change too.
> >
> > It's very possibly too late to get it done before beta1,
> > unfortunately.  But as Andres noted, post-beta1 catversion
> > bumps are hardly unusual, so I do not think "rework after
> > beta1" is unacceptable.
>
> Agreed.
>

Yes, that makes sense.

I think we shouldn't risk trying to get this into beta1, but let's try
to get it done as soon as possible after that.

Actually, it doesn't appear to be as big a change as I had feared. As
a starter for ten, here's a patch doing the basic split, moving the
extended stats data into a new catalog pg_statistic_ext_data (I'm not
particularly wedded to that name, it's just the first name that came
to mind).

With this patch the catalogs look like this:


\d pg_statistic_ext
Table "pg_catalog.pg_statistic_ext"
Column|Type| Collation | Nullable | Default
--++---+--+-
 oid  | oid|   | not null |
 stxrelid | oid|   | not null |
 stxname  | name   |   | not null |
 stxnamespace | oid|   | not null |
 stxowner | oid|   | not null |
 stxkeys  | int2vector |   | not null |
 stxkind  | "char"[]   |   | not null |
Indexes:
"pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace)
"pg_statistic_ext_oid_index" UNIQUE, btree (oid)
"pg_statistic_ext_relid_index" btree (stxrelid)


\d pg_statistic_ext_data
  Table "pg_catalog.pg_statistic_ext_data"
 Column  |  Type   | Collation | Nullable | Default
-+-+---+--+-
 stxoid  | oid |   | not null |
 stxndistinct| pg_ndistinct| C |  |
 stxdependencies | pg_dependencies | C |  |
 stxmcv  | pg_mcv_list | C |  |
Indexes:
"pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid)


I opted to create/remove pg_statistic_ext_data tuples at the same time
as the pg_statistic_ext tuples, in CreateStatistics() /
RemoveStatisticsById(), so then it's easier to see that they're in a
one-to-one relationship, and other code doesn't need to worry about
the data tuple not existing. The other option would be to defer
inserting the data tuple to ANALYZE.

I couldn't resist moving the code block that declares
pg_statistic_ext's indexes in indexing.h to the right place, assuming
that file is (mostly) sorted alphabetically by catalog name. This puts
the extended stats entries just after the normal stats entries which
seems preferable.

This is only a very rough first draft (e.g., no doc updates), but it
passes all the regression tests.

Regards,
Dean
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
new file mode 100644
index f186198..8bece07
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -34,7 +34,7 @@ CATALOG_HEADERS := \
 	pg_attrdef.h pg_constraint.h pg_inherits.h pg_index.h pg_operator.h \
 	pg_opfamily.h pg_opclass.h pg_am.h pg_amop.h pg_amproc.h \
 	pg_language.h pg_largeobject_metadata.h pg_largeobject.h pg_aggregate.h \
-	pg_statistic_ext.h \
+	pg_statistic_ext.h pg_statistic_ext_data.h \
 	pg_statistic.h pg_rewrite.h pg_trigger.h pg_event_trigger.h pg_description.h \
 	pg_cast.h pg_enum.h pg_namespace.h pg_conversion.h pg_depend.h \
 	pg_database.h pg_db_role_setting.h pg_tablespace.h pg_pltemplate.h \
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
new file mode 100644
index a191916..7cfaa96
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -23,6 +23,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_statistic_ext.h"
+#include "catalog/pg_statistic_ext_data.h"
 #include "commands/comment.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
@@ -67,8 +68,11 @@ CreateStatistics(CreateStatsStmt *stmt)
 	HeapTuple	htup;
 	Datum		values[Natts_pg_statistic_ext];
 	bool		nulls[Natts_pg_statistic_ext];
+	Datum		datavalues[Natts_pg_statistic_ext_data];
+	bool		datanulls[Natts_pg_statistic_ext_data];
 	int2vector *stxkeys;
 	Relation	statrel;
+	Relation	datarel;
 	Relation	rel = NULL;
 	Oid			relid;
 	ObjectAddress parentobject,
@@ -336,11 +340,6 @@ CreateStatistics(CreateStatsStmt *stmt)

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-20 Thread Dean Rasheed

On Sun, 19 May 2019 at 23:45, Tomas Vondra  wrote:
>
> Oh, right. It still has the disadvantage that it obfuscates the actual
> data stored in the pg_stats_ext_data (or whatever would it be called),
> so e.g. functions would have to do additional checks to make sure it
> actually is the right statistic type. For example pg_mcv_list_items()
> could not rely on receiving pg_mcv_list values, as per the signature,
> but would have to check the value.
>

Yes. In fact, since the user-accessible view would want to expose
datatypes specific to the stats kinds rather than bytea or cstring
values, we would need SQL-callable conversion functions for each kind:

* to_pg_ndistinct(pg_extended_stats_ext_data) returns pg_ndistinct
* to_pg_dependencies(pg_extended_stats_ext_data) returns pg_dependencies
* to_pg_mcv(pg_extended_stats_ext_data) returns pg_mcv
* ...

and each of these would throw an error if it weren't given an extended
stats object of the right kind. Then to extract MCV data, you'd have
to do pg_mcv_list_items(to_pg_mcv(ext_data)), and presumably there'd
be something similar for histograms.

IMO, that's not a nice model, compared to just having columns of the
right types in the first place.

Also this model presupposes that all future stats kinds are most
conveniently represented in a single column, but maybe that won't be
the case. It's conceivable that a future stats kind would benefit from
splitting its data across multiple columns.

> Of course, I don't expect to have too many such functions, but overall
> this approach with a single type feels a bit too like EAV to my taste.
>

Yes, I think it is an EAV model. I think EAV models do have their
place, but I think that's largely where adding new columns is a common
operation and involves adding little to no extra code. I don't think
either of those is true for extended stats. What we've seen over the
last couple of years is that adding each new stats kind is a large
undertaking, involving lots of new code. That alone is going to limit
just how many ever get added, and compared to that effort, adding new
columns to the catalog is small fry.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-20 Thread Dean Rasheed

On Mon, 20 May 2019 at 14:32, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > On Sun, 19 May 2019 at 23:45, Tomas Vondra  
> > wrote:
> >> Oh, right. It still has the disadvantage that it obfuscates the actual
> >> data stored in the pg_stats_ext_data (or whatever would it be called),
> >> so e.g. functions would have to do additional checks to make sure it
> >> actually is the right statistic type. For example pg_mcv_list_items()
> >> could not rely on receiving pg_mcv_list values, as per the signature,
> >> but would have to check the value.
>
> > Yes. In fact, since the user-accessible view would want to expose
> > datatypes specific to the stats kinds rather than bytea or cstring
> > values, we would need SQL-callable conversion functions for each kind:
>
> It seems like people are willfully misunderstanding my suggestion.

I'm more than capable of inadvertently misunderstanding, without the
need to willfully do so :-)

> You'd only need *one* conversion function, which would look at the
> embedded ID field and then emit the appropriate text representation.
> I don't see a reason why we'd have the separate pg_ndistinct etc. types
> any more at all.

Hmm, OK. So then would you also make the user-accessible view agnostic
about the kinds of stats supported in the same way, returning zero or
more rows per STATISTICS object, depending on how many kinds of stats
have been built? That would have the advantage of never needing to
change the view definition again, as more stats kinds are supported.

We'd need to change pg_mcv_list_items() to accept a pg_stats_ext_data
value rather than a pg_mcv value, and it would be the user's
responsibility to call it if they wanted to see the contents of the
MCV list (I was originally thinking that we'd include a call to
pg_mcv_list_items() in the view definition, so that it produced
friendlier looking output, since the default textual representation of
an MCV list is completely opaque, unlike the other stats kinds).
Actually, I can see another advantage to not including
pg_mcv_list_items() in the view definition -- in the future, we may
dream up a better version of pg_mcv_list_items(), like say one that
produced JSON, and then we'd regret using the current function.

> Anyway, it was just a suggestion, and if people don't like it that's
> fine.  But I don't want it to be rejected on the basis of false
> arguments.

To be clear, I'm not intentionally rejecting your idea. I'm merely
trying to fully understand the implications.

At this stage, perhaps it would be helpful to prototype something for
comparison.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-18 Thread Dean Rasheed

On Sat, 18 May 2019 at 10:11, Dean Rasheed  wrote:
>
> On Fri, 17 May 2019 at 21:29, Andres Freund  wrote:
> >
> > On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:
> > > 5). Some columns from pg_statistic_ext have to be made visible for
> > > psql \d to work. Basically, it needs to be able to query for the
> > > existence of extended statistics, but it doesn't need to see the
> > > actual statistical data. Of course, we could change psql to use the
> > > view, but this way gives us better backwards compatibility with older
> > > clients.
> > >
> > > This is still going to break compatibility of any user code looking at
> > > stxndistinct or stxdependencies from pg_statistic_ext, but at least it
> > > doesn't break old versions of psql.
> >
> > Hm, it's not normally a goal to keep old psql working against new
> > postgres versions. And there's plenty other issues preventing a v11 psql
> > to work against 12. I'd not let this guide any design decisions.
> >
>
> Ah good point. In fact running "\d some_table" from v11's psql against
> a v12 database immediately falls over because of the removal of
> relhasoids from pg_class, so this isn't a valid reason for retaining
> access to any columns from pg_statistic_ext.
>

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change that
to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

That could be fixed by changing the view to return rows for every
extended statistics object, nulling out values in columns that the
user doesn't have permission to see, in a similar way to Tomas'
original patch. It would have to be modified to do the RLS check in
the same place as the privilege checks, rather than in the top-level
WHERE clause, and we'd probably also have to expose OIDs in addition
to names, because that's what clients like psql and pg_dump want. To
me, that feels quite messy though, so I think I'd still vote for
leaving the first few columns of pg_statistic_ext accessible to
public, and not have to change the clients to work differently from
v12 onwards.

Regards,
Dean

Re: Multivariate MCV stats can leak data to unprivileged users

2019-05-18 Thread Dean Rasheed

On Fri, 17 May 2019 at 21:29, Andres Freund  wrote:
>
> On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:
> > 5). Some columns from pg_statistic_ext have to be made visible for
> > psql \d to work. Basically, it needs to be able to query for the
> > existence of extended statistics, but it doesn't need to see the
> > actual statistical data. Of course, we could change psql to use the
> > view, but this way gives us better backwards compatibility with older
> > clients.
> >
> > This is still going to break compatibility of any user code looking at
> > stxndistinct or stxdependencies from pg_statistic_ext, but at least it
> > doesn't break old versions of psql.
>
> Hm, it's not normally a goal to keep old psql working against new
> postgres versions. And there's plenty other issues preventing a v11 psql
> to work against 12. I'd not let this guide any design decisions.
>

Ah good point. In fact running "\d some_table" from v11's psql against
a v12 database immediately falls over because of the removal of
relhasoids from pg_class, so this isn't a valid reason for retaining
access to any columns from pg_statistic_ext.

Regards,
Dean

1 2 3 4 5 6 >

1 - 100 of 532 matches

Mail list logo