Hi James,

On Fri, May 31, 2019 at 03:51:57PM -0400, James Coleman wrote:
I've rebased the patch on master and confirmed make check world passes.

Thanks for the rebase! I think the patch is in pretty good shape - I'm
sure we'll find ways to make it more efficient etc. but IMO that's fine
and should not prevent getting it committed.

The planning/costing logic may need further discussion and improvements,
though. IIRC this was the main reason why the patch missed PG11, because
at that point it simply used incremental sort whenever the input was
already presorted by a pathkey prefix, but that may be slower than regular
sort in some cases (unexpectedly large groups, etc.).

I see the current patch partially improves this by removing the creating
both paths (full and and incremental sort). That'd good, because it means
the decision is cost-based (as it should be). The question however is how
accurate the costing model is - and per discussion in the thread, it may
need some improvements, to handle skewed distributions better.

Currently, the costing logic (cost_incremental_sort) assumes all groups
have the same size, which is fine for uniform distributions. But for
skewed distributions, that may be an issue.

Consider for example a table with 1M rows, two columns, 100 groups in each
column, and index on the first one.

   CREATE table t (a INT, b INT);

   INSERT INTO t SELECT 100*random(), 100*random()
     FROM generate_series(1,1000000);

Now, let's do a simple limit query to find the first row:

   SELECT * FROM t ORDER BU a, b LIMIT 1;

In this case the current costing logic is fine - the groups are close to
average size, and we only need to sort the first group, i.e. 1% of data.

Now, let's say the first group is much larger:


   INSERT INTO t SELECT 0, 100*random()
     FROM generate_series(1,900000) s(i);

   INSERT INTO t SELECT 100*random(), 100*random()
     FROM generate_series(1,100000);

That is, the first group is roughly 90% of data, but the number of groups
is still the same. But we need to scan 90% of data. But the average group
size is still the same, so the current cost model is oblivious to this.

But I think we can improve this by looking at the MCV lists (either
per-column or multi-column) and see if some groups are much larger, and
consider that when computing the costs.

In particular, I think we should estimate the size of the first group,
because that's important for startup cost - we need to process the whole
first group before producing the first tuple, and that matters for LIMIT
queries etc.

For example, let's say the first (already sorted) column has a MCV. Then
we can see how large the first group (by valule, not frequency) is, and
use that instead of the average group size. E.g. in the above example we'd
know the first group is ~90%.

And we could do the same for multiple columns, either by looking at
multi-column MCV lists (if there's one), or by using minimum from each
per-column MCV lists.

Of course, these are only the groups that made it to the MCV list, and
there may be other (smaller) groups before this large one. For example
there could be a group with "-1" value and a single row.

For a moment I thought we could/should look at the histogram, becase that
could tell us if there are groups "before" the first MCV one, but I don't
think we should do that, for two reasons. Firstly, rare values may not get
to the histogram anyway, which makes this rather unreliable and might
introduce sudden plan changes, because the cost would vary wildly
depending on whether we happened to sample the rare row or not. And
secondly, the rare row may be easily filtered out by a WHERE condition or
something, at which point we'll have to deal with the large group anyway.

So I think we should look at the MCV list, and use that information when
computing the startup/total cost. I think using the first/largest group to
compute the startup cost, and the average group size for total cost would
do the trick.

I don't think we can do much better than this during planning. There will
inevitably be cases where the costing model will push us to do the wrong
thing, in either direction. The question is how serious issue that is, and
whether we could remedy that during execution somehow.

When we "incorrectly" pick full sort (when the incremental sort would be
faster), that's obviously sad but I think it's mostly fine because it's
not a regression.

The opposite direction (picking incremental sort, while full sort being
faster) is probably more serious, because it's a regression between
releases.

I don't think we can fully fix that by refining the cost model. We have
two basic options:

1) Argue/show that this is not an issue in practice, because (a) such
cases are very rare, and/or (b) the regression is limited. In short, the
benefits of the patch outweight the losses.

2) Provide some fallback at execution time. For example, we might watch
the size of the group, and if we run into an unexpectedly large one we
might abandon the incremental sort and switch to a "full sort" mode.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reply via email to