On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan <p...@heroku.com> wrote:
> HyperLogLog isn't sample-based - it's useful for streaming a set and
> accurately tracking its cardinality with fixed overhead.

OK.

>> Is it the right decision to suppress the abbreviated-key optimization
>> unconditionally on 32-bit systems and on Darwin?  There's certainly
>> more danger, on those platforms, that the optimization could fail to
>> pay off.  But it could also win big, if in fact the first character or
>> two of the string is enough to distinguish most rows, or if Darwin
>> improves their implementation in the future.  If the other defenses
>> against pathological cases in the patch are adequate, I would think
>> it'd be OK to remove the hard-coded checks here and let those cases
>> use the optimization or not according to its merits in particular
>> cases.  We'd want to look at what the impact of that is, of course,
>> but if it's bad, maybe those other defenses aren't adequate anyway.
>
> I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
> using Macs to run real database servers. Apple haven't had a server
> product in years, and typically people only use Postgres on their Macs
> for development. We might as well have coverage of the new code for
> the benefit of Postgres hackers that favor Apple machines. Or, to look
> at it another way, the optimization is so beneficially that it's
> probably worth the risk, even for more marginal cases.
>
> 8 primary weights (the leading 8 bytes, frequently isomorphic to the
> first 8 Latin characters, regardless of whether or not they have
> accents/diacritics, or punctuation/whitespace) is twice as many as 4.
> But every time you add a byte of space to the abbreviated
> representation that can resolve a comparison, the number of
> unresolvable-without-tiebreak comparisons (in general) is, I imagine,
> reduced considerably. Basically, 8 bytes is way better than twice as
> good as 4 bytes in terms of its effect on the proportion of
> comparisons that are resolved only with abbreviated keys. Even still,
> I suspect it's still worth it to apply the optimization with only 4.
>
> You've seen plenty of suggestions on assessing the applicability of
> the optimization from me. Perhaps you have a few of your own.

My suggestion is to remove the special cases for Darwin and 32-bit
systems and see how it goes.

> That wouldn't be harmless - it would probably result in incorrect
> answers in practice, and would certainly be unspecified. However, I'm
> not reading uninitialized bytes. I call memset() so that in the event
> of the final strxfrm() blob being less than 8 bytes (which can happen
> even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
> every Datum byte if the remaining bytes are always initialized to NUL.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to