Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Tomas Vondra Thu, 30 Apr 2015 18:21:07 -0700

Hi,

On 04/30/15 22:57, Robert Haas wrote:

On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<[email protected]> wrote:

attached is v4 of the patch implementing adaptive ndistinct estimator.


So, I took a look at this today. It's interesting work, but it looks
more like a research project than something we can commit to 9.5. As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method.
The estimate computed the new way isn't stored anywhere, so this
doesn't really change behavior, except for printing out a WARNING
comparing the values produced by the two estimators.


I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

So far it received only reviews from Jeff Janes (thanks!), but I think achange like this really requires a more thorough review, including themath part. I don't expect that to happen at the very end of the last CFbefore the freeze.

IMHO, the comments in this patch are pretty inscrutable.  I believe
this is because they presume more knowledge of what the patch is doing
than I myself possess.  For example:

+ * The AEE estimator is based on solving this equality (for "m")
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)

Perhaps I am just a dummy, but I have no idea what any of that means.
I think that needs to be fixed so that someone who knows what
n_distinct is but knows nothing about the details of these estimators
can get an idea of how they are doing their thing without too much
effort.  I think a lot of the comments share this problem.

Well, I don't think you're dummy, but this requires reading the paperdescribing the estimator. Explaining that fully would essentially meancopying a large portion of the paper in the comment, and I supposethat's not a good idea. The explanation might be perhaps a bit moredetailed, though - not sure what's the right balance.

Aside from the problems mentioned above, there's the broader question
of how to evaluate the quality of the estimates produced by this
estimator vs. what we're doing right now. I see that Jeff Janes has
pointed out some cases that seem to regress with this patch; those
presumably need some investigation, or at least some comment. And I
think some testing from other people would be good too, before we
commit to this.

Yeah, evaluating is difficult. I think that Jeff's approach - i.e.testing the estimator on real-world data sets - is the right approachhere. Testing on synthetic data sets has it's value too (if only tobetter understand the failure cases).

Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases". It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true. So
committing this figures to be extremely frightening. It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are
going to be less happy.

For every pair of estimators you can find cases where one of them isbetter than the other one. It's pretty much impossible to find anestimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it willproduce worse estimates in some cases, and we have to admit that. If wethink this is unacceptable, we're effectively stuck with the currentestimator forever.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle,
not near the end. That way, if there are problem cases, we'll have a
year or so of developer test to shake them out.

+1

2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not
confident that the new behavior will be better for most people, we
have no business installing it in the first place (plus few people
will try it). But just in case it turns out to suck for some people,
we should provide an escape hatch, at least for a few releases.


I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator toperform worse only on a few data sets / columns. So making this acolumn-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would makecomparing the estimators possible (just run the tests with the GUC setdifferently). Assuming we'll re-evaluate it at the end, and remove theGUC if possible.


3. There should be some clear documentation in the comments indicating
why we believe that this is a whole lot better than what we do today.
Maybe this has been discussed adequately on the thread and maybe it
hasn't, but whoever goes to look at the committed code should not have
to go root through hackers threads to understand why we replaced the
existing estimator.  It should be right there in the code.  If,
hypothetically speaking, I were to commit this, and if, again strictly
hypothetically, another distinguished committer were to write back a
year or two later, "clearly Robert was an idiot to commit this because
it's no better than what we had before" then I want to be able to say
"clearly, you have not what got committed very carefully, because the
comment for function <blat> clearly explains that this new technology
is teh awesome".


I certainly can add such comment to the patch ;-) Choose a function.


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Reply via email to