Hi,

On 04/30/15 22:57, Robert Haas wrote:
On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
attached is v4 of the patch implementing adaptive ndistinct estimator.

So, I took a look at this today. It's interesting work, but it looks
more like a research project than something we can commit to 9.5. As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method.
The estimate computed the new way isn't stored anywhere, so this
doesn't really change behavior, except for printing out a WARNING
comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

So far it received only reviews from Jeff Janes (thanks!), but I think a change like this really requires a more thorough review, including the math part. I don't expect that to happen at the very end of the last CF before the freeze.

IMHO, the comments in this patch are pretty inscrutable.  I believe
this is because they presume more knowledge of what the patch is doing
than I myself possess.  For example:

+ * The AEE estimator is based on solving this equality (for "m")
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)

Perhaps I am just a dummy, but I have no idea what any of that means.
I think that needs to be fixed so that someone who knows what
n_distinct is but knows nothing about the details of these estimators
can get an idea of how they are doing their thing without too much
effort.  I think a lot of the comments share this problem.

Well, I don't think you're dummy, but this requires reading the paper describing the estimator. Explaining that fully would essentially mean copying a large portion of the paper in the comment, and I suppose that's not a good idea. The explanation might be perhaps a bit more detailed, though - not sure what's the right balance.

Aside from the problems mentioned above, there's the broader question
of how to evaluate the quality of the estimates produced by this
estimator vs. what we're doing right now. I see that Jeff Janes has
pointed out some cases that seem to regress with this patch; those
presumably need some investigation, or at least some comment. And I
think some testing from other people would be good too, before we
commit to this.

Yeah, evaluating is difficult. I think that Jeff's approach - i.e. testing the estimator on real-world data sets - is the right approach here. Testing on synthetic data sets has it's value too (if only to better understand the failure cases).

Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases". It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true. So
committing this figures to be extremely frightening. It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are
going to be less happy.

For every pair of estimators you can find cases where one of them is better than the other one. It's pretty much impossible to find an estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will produce worse estimates in some cases, and we have to admit that. If we think this is unacceptable, we're effectively stuck with the current estimator forever.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle,
not near the end. That way, if there are problem cases, we'll have a
year or so of developer test to shake them out.

+1

2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not
confident that the new behavior will be better for most people, we
have no business installing it in the first place (plus few people
will try it). But just in case it turns out to suck for some people,
we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to perform worse only on a few data sets / columns. So making this a column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make comparing the estimators possible (just run the tests with the GUC set differently). Assuming we'll re-evaluate it at the end, and remove the GUC if possible.


3. There should be some clear documentation in the comments indicating
why we believe that this is a whole lot better than what we do today.
Maybe this has been discussed adequately on the thread and maybe it
hasn't, but whoever goes to look at the committed code should not have
to go root through hackers threads to understand why we replaced the
existing estimator.  It should be right there in the code.  If,
hypothetically speaking, I were to commit this, and if, again strictly
hypothetically, another distinguished committer were to write back a
year or two later, "clearly Robert was an idiot to commit this because
it's no better than what we had before" then I want to be able to say
"clearly, you have not what got committed very carefully, because the
comment for function <blat> clearly explains that this new technology
is teh awesome".

I certainly can add such comment to the patch ;-) Choose a function.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to