Re: [HACKERS] Abbreviated keys for Numeric

Andrew Gierth Sun, 22 Mar 2015 18:49:52 -0700

>>>>> "Peter" == Peter Geoghegan <p...@heroku.com> writes:


 Peter> I don't think it's that simple. The actual point of abbreviation
 Peter> is to amortize the total cost of performing O(n log n)
 Peter> comparisons (the average case, which is usually assumed by
 Peter> infrastructure like sortsupport), by performing an encoding
 Peter> process n times. That encoding process may be more expensive
 Peter> than each of the comparisons. Sometimes significantly more
 Peter> expensive.

But the cost of the encoding depends only on the number of non-null
values. Again, the nulls turn out to be irrelevant.

 Peter> Forget about abbreviation entirely for a minute - consider plain
 Peter> old sorting of strxfrm() blobs, which for example the glibc
 Peter> documentation recommends (with some caveats) as the preferred
 Peter> approach to sorting strings while respecting collation. Suppose
 Peter> that your applications happens to frequently encounter fully
 Peter> sorted arrays of strings when passed an array of strings to
 Peter> sort. If you were using our qsort() implementation, which,
 Peter> rather questionably, has a pre-check for fully sorted input
 Peter> (that throws away work when the pre-check fails), then you'll
 Peter> only have n comparisons. Even if an encoding is only as
 Peter> expensive as an actual comparison, the potential to lose with
 Peter> encoding clearly exists. Similarly, we don't abbreviate for
 Peter> top-n heap sorts, even though all of those abbreviated keys may
 Peter> be used for at least one comparison.

Nothing in the above paragraph has the slightest relevance to even the
text abbreviation code, much less the numeric one.

 Peter> Now, I hear what you're saying about weighing the costs and the
 Peter> benefits -- you point out that abbreviation of NULL values is
 Peter> not a cost, which is certainly true.

It's not a cost because it DOESN'T HAPPEN. The abbreviation function is
not called for null inputs.

 Peter> But if the total number of NULL values is so high that they
 Peter> dominate, then abbreviation might well be the wrong thing for
 Peter> that reason alone.

No. This is the point that you're getting wrong. The amount of time
saved by the abbreviation code depends ONLY on the number and
cardinality of the NON-NULL values. Also, the time _lost_ by the
abbreviation code if we fail to abort in pathological cases STILL
depends only on the number of non-null values, so there's no benefit in
aborting even in a case when we have 1000 equal non-null values mixed in
with tens of millions of nulls.

Let me give you an actual example:

create table numtest2 as
  select floor(random() * 200)::numeric as a
    from generate_series(1,1000000);
create table numtest3 as
  select a from (select floor(random() * 200)::numeric as a
                   from generate_series(1,1000000)
                  union all
                 select null from generate_series(1,4000000)) s
  order by random();

So one table has a million non-null rows with 200 distinct values, and
the other has the same plus 4 million null rows.

Using my code, which ignores nulls in the cost model:

select * from numtest2 order by a offset 100000000;  -- 1.5 seconds
select * from numtest3 order by a offset 100000000;  -- 3.6 seconds

With abbreviation disabled entirely:

select * from numtest2 order by a offset 100000000;  -- 4.5 seconds
select * from numtest3 order by a offset 100000000;  -- 6.8 seconds

Notice that while proportionally the speedup is only <2x rather than 3x
for the version with nulls, the absolute difference in time is the same
in both cases - abbreviation saved us ~3 seconds in both cases.

(This relation remains true for larger values too: make it 19 million
nulls rather than 4 million and the timings are 11.5 sec / 14.5 sec,
still a 3 sec difference.)

Your version would have aborted abbrevation on that second query, thus
losing a 3 second speedup. How on earth is that supposed to be
justified? It's not like there's any realistically possible case where
your version performs faster than mine by more than a tiny margin.

 Peter> Where I think your argument is stronger is around the cost of
 Peter> actually aborting in particular (even though not much work is
 Peter> thrown away).  Scanning through the memtuples array once more
 Peter> certainly isn't free.

Yes, the cost of actually aborting abbreviation goes up with the number
of nulls. But your version of the code makes that WORSE, by making it
more likely that we will abort unnecessarily.

If we use the raw value of memtuplecount for anything, it should be to
make it LESS likely that we abort abbreviations (since we'd be paying a
higher cost), not more. But benchmarking doesn't suggest that this is
necessary, at least not for numerics.

 Peter> And you could take the view that it's always worth the risk,
 Peter> since it's at most a marginal cost saved if the first ~10k
 Peter> tuples are representative, but a large cost incurred when they
 Peter> are not. So on reflection, I am inclined to put the check for
 Peter> non-null values back in.

Right result but the wrong reasoning.

 Peter> I also concede that the cost of abbreviation is lower with your
 Peter> numeric encoding scheme than it is for that of the text opclass,
 Peter> which favors this approach.

If you do some benchmarks on text columns, you'll find that this is a
win there too.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Abbreviated keys for Numeric

Reply via email to