>>>>> "Peter" == Peter Geoghegan <p...@heroku.com> writes:
Peter> I don't think it's that simple. The actual point of abbreviation Peter> is to amortize the total cost of performing O(n log n) Peter> comparisons (the average case, which is usually assumed by Peter> infrastructure like sortsupport), by performing an encoding Peter> process n times. That encoding process may be more expensive Peter> than each of the comparisons. Sometimes significantly more Peter> expensive. But the cost of the encoding depends only on the number of non-null values. Again, the nulls turn out to be irrelevant. Peter> Forget about abbreviation entirely for a minute - consider plain Peter> old sorting of strxfrm() blobs, which for example the glibc Peter> documentation recommends (with some caveats) as the preferred Peter> approach to sorting strings while respecting collation. Suppose Peter> that your applications happens to frequently encounter fully Peter> sorted arrays of strings when passed an array of strings to Peter> sort. If you were using our qsort() implementation, which, Peter> rather questionably, has a pre-check for fully sorted input Peter> (that throws away work when the pre-check fails), then you'll Peter> only have n comparisons. Even if an encoding is only as Peter> expensive as an actual comparison, the potential to lose with Peter> encoding clearly exists. Similarly, we don't abbreviate for Peter> top-n heap sorts, even though all of those abbreviated keys may Peter> be used for at least one comparison. Nothing in the above paragraph has the slightest relevance to even the text abbreviation code, much less the numeric one. Peter> Now, I hear what you're saying about weighing the costs and the Peter> benefits -- you point out that abbreviation of NULL values is Peter> not a cost, which is certainly true. It's not a cost because it DOESN'T HAPPEN. The abbreviation function is not called for null inputs. Peter> But if the total number of NULL values is so high that they Peter> dominate, then abbreviation might well be the wrong thing for Peter> that reason alone. No. This is the point that you're getting wrong. The amount of time saved by the abbreviation code depends ONLY on the number and cardinality of the NON-NULL values. Also, the time _lost_ by the abbreviation code if we fail to abort in pathological cases STILL depends only on the number of non-null values, so there's no benefit in aborting even in a case when we have 1000 equal non-null values mixed in with tens of millions of nulls. Let me give you an actual example: create table numtest2 as select floor(random() * 200)::numeric as a from generate_series(1,1000000); create table numtest3 as select a from (select floor(random() * 200)::numeric as a from generate_series(1,1000000) union all select null from generate_series(1,4000000)) s order by random(); So one table has a million non-null rows with 200 distinct values, and the other has the same plus 4 million null rows. Using my code, which ignores nulls in the cost model: select * from numtest2 order by a offset 100000000; -- 1.5 seconds select * from numtest3 order by a offset 100000000; -- 3.6 seconds With abbreviation disabled entirely: select * from numtest2 order by a offset 100000000; -- 4.5 seconds select * from numtest3 order by a offset 100000000; -- 6.8 seconds Notice that while proportionally the speedup is only <2x rather than 3x for the version with nulls, the absolute difference in time is the same in both cases - abbreviation saved us ~3 seconds in both cases. (This relation remains true for larger values too: make it 19 million nulls rather than 4 million and the timings are 11.5 sec / 14.5 sec, still a 3 sec difference.) Your version would have aborted abbrevation on that second query, thus losing a 3 second speedup. How on earth is that supposed to be justified? It's not like there's any realistically possible case where your version performs faster than mine by more than a tiny margin. Peter> Where I think your argument is stronger is around the cost of Peter> actually aborting in particular (even though not much work is Peter> thrown away). Scanning through the memtuples array once more Peter> certainly isn't free. Yes, the cost of actually aborting abbreviation goes up with the number of nulls. But your version of the code makes that WORSE, by making it more likely that we will abort unnecessarily. If we use the raw value of memtuplecount for anything, it should be to make it LESS likely that we abort abbreviations (since we'd be paying a higher cost), not more. But benchmarking doesn't suggest that this is necessary, at least not for numerics. Peter> And you could take the view that it's always worth the risk, Peter> since it's at most a marginal cost saved if the first ~10k Peter> tuples are representative, but a large cost incurred when they Peter> are not. So on reflection, I am inclined to put the check for Peter> non-null values back in. Right result but the wrong reasoning. Peter> I also concede that the cost of abbreviation is lower with your Peter> numeric encoding scheme than it is for that of the text opclass, Peter> which favors this approach. If you do some benchmarks on text columns, you'll find that this is a win there too. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers