Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Peter Geoghegan Thu, 12 Jun 2014 14:25:28 -0700

Thanks for looking into this.

On Thu, Jun 12, 2014 at 9:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.


Right. It was a little bit incautious of me to say that we had the
full benefit of 8 bytes of storage with "en_US.UTF-8", since that is
only true of lower case characters (I think that FreeBSD can play
tricks with this. Sometimes, it will give you the benefit of 8 bytes
of entropy for an 8 byte string, with only non-differentiating
trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct
from the first eight bytes of "aaaaaaaa", while any trailing bytes are
non-distinct for both). In any case it's pretty clear that a goal of
the glibc implementation is to concentrate as much entropy as possible
into the first part of the string, and that's the important point.
This makes perfect sense, and is why I was so incredulous about the
Mac behavior. After all, the Open Group's strcoll() documentation
says:

"The strxfrm() and strcmp() functions should be used for sorting large lists."

Sorting text is hardly an infrequent requirement -- it's almost the
entire reason for having strxfrm() in the standard. You're always
going to want to have each strcmp() find differences as early as
possible.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

Reply via email to