Hi James, On Fri, May 31, 2019 at 03:51:57PM -0400, James Coleman wrote:
I've rebased the patch on master and confirmed make check world passes.
Thanks for the rebase! I think the patch is in pretty good shape - I'm sure we'll find ways to make it more efficient etc. but IMO that's fine and should not prevent getting it committed. The planning/costing logic may need further discussion and improvements, though. IIRC this was the main reason why the patch missed PG11, because at that point it simply used incremental sort whenever the input was already presorted by a pathkey prefix, but that may be slower than regular sort in some cases (unexpectedly large groups, etc.). I see the current patch partially improves this by removing the creating both paths (full and and incremental sort). That'd good, because it means the decision is cost-based (as it should be). The question however is how accurate the costing model is - and per discussion in the thread, it may need some improvements, to handle skewed distributions better. Currently, the costing logic (cost_incremental_sort) assumes all groups have the same size, which is fine for uniform distributions. But for skewed distributions, that may be an issue. Consider for example a table with 1M rows, two columns, 100 groups in each column, and index on the first one. CREATE table t (a INT, b INT); INSERT INTO t SELECT 100*random(), 100*random() FROM generate_series(1,1000000); Now, let's do a simple limit query to find the first row: SELECT * FROM t ORDER BU a, b LIMIT 1; In this case the current costing logic is fine - the groups are close to average size, and we only need to sort the first group, i.e. 1% of data. Now, let's say the first group is much larger: INSERT INTO t SELECT 0, 100*random() FROM generate_series(1,900000) s(i); INSERT INTO t SELECT 100*random(), 100*random() FROM generate_series(1,100000); That is, the first group is roughly 90% of data, but the number of groups is still the same. But we need to scan 90% of data. But the average group size is still the same, so the current cost model is oblivious to this. But I think we can improve this by looking at the MCV lists (either per-column or multi-column) and see if some groups are much larger, and consider that when computing the costs. In particular, I think we should estimate the size of the first group, because that's important for startup cost - we need to process the whole first group before producing the first tuple, and that matters for LIMIT queries etc. For example, let's say the first (already sorted) column has a MCV. Then we can see how large the first group (by valule, not frequency) is, and use that instead of the average group size. E.g. in the above example we'd know the first group is ~90%. And we could do the same for multiple columns, either by looking at multi-column MCV lists (if there's one), or by using minimum from each per-column MCV lists. Of course, these are only the groups that made it to the MCV list, and there may be other (smaller) groups before this large one. For example there could be a group with "-1" value and a single row. For a moment I thought we could/should look at the histogram, becase that could tell us if there are groups "before" the first MCV one, but I don't think we should do that, for two reasons. Firstly, rare values may not get to the histogram anyway, which makes this rather unreliable and might introduce sudden plan changes, because the cost would vary wildly depending on whether we happened to sample the rare row or not. And secondly, the rare row may be easily filtered out by a WHERE condition or something, at which point we'll have to deal with the large group anyway. So I think we should look at the MCV list, and use that information when computing the startup/total cost. I think using the first/largest group to compute the startup cost, and the average group size for total cost would do the trick. I don't think we can do much better than this during planning. There will inevitably be cases where the costing model will push us to do the wrong thing, in either direction. The question is how serious issue that is, and whether we could remedy that during execution somehow. When we "incorrectly" pick full sort (when the incremental sort would be faster), that's obviously sad but I think it's mostly fine because it's not a regression. The opposite direction (picking incremental sort, while full sort being faster) is probably more serious, because it's a regression between releases. I don't think we can fully fix that by refining the cost model. We have two basic options: 1) Argue/show that this is not an issue in practice, because (a) such cases are very rare, and/or (b) the regression is limited. In short, the benefits of the patch outweight the losses. 2) Provide some fallback at execution time. For example, we might watch the size of the group, and if we run into an unexpectedly large one we might abandon the incremental sort and switch to a "full sort" mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services