Thanks for looking into this. On Thu, Jun 12, 2014 at 9:25 AM, Robert Haas <robertmh...@gmail.com> wrote: > Still, it's fair to say that on this Linux system, the first 8 bytes > capture a significant portion of the entropy of the first 8 bytes of > the string, whereas on MacOS X you only get entropy from the first 2 > bytes of the string. It would be interesting to see results from > other platforms people might care about also.
Right. It was a little bit incautious of me to say that we had the full benefit of 8 bytes of storage with "en_US.UTF-8", since that is only true of lower case characters (I think that FreeBSD can play tricks with this. Sometimes, it will give you the benefit of 8 bytes of entropy for an 8 byte string, with only non-differentiating trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct from the first eight bytes of "aaaaaaaa", while any trailing bytes are non-distinct for both). In any case it's pretty clear that a goal of the glibc implementation is to concentrate as much entropy as possible into the first part of the string, and that's the important point. This makes perfect sense, and is why I was so incredulous about the Mac behavior. After all, the Open Group's strcoll() documentation says: "The strxfrm() and strcmp() functions should be used for sorting large lists." Sorting text is hardly an infrequent requirement -- it's almost the entire reason for having strxfrm() in the standard. You're always going to want to have each strcmp() find differences as early as possible. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers