On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ft...@vitessedata.com> wrote: > I feel sorting string as if it is bytea is particularly interesting. I am > aware Peter G's patch and I think it is great, but for this sort agg case, > first, I believe it is still slower than sorting bytea, and second, Peter > G's patch depends on data. A common long prefix will make the patch less > effective, which is probably not so uncommon (for example, URL with a domain > prefix). I don't see any downside of sort bytea, other than lost the > interest ordering.
FWIW, that's probably less true than you'd think. Using Robert's test program: pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something" "http://www.something" -> 131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444 (59 bytes) pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another" "http://www.another" -> 131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444 (53 bytes) So the first eight bytes of the first string is 0x131F1F1B2222221E, and the second 0x131F1F1B2222220C. The last byte is different. That's because the way the Unicode algorithm [1] works, there is often a significantly greater concentration of entropy in the first 8 bytes as compared to raw C strings compared with memcmp() - punctuation characters and so on are not actually described at the primary weight level. If we can get even a single byte to somewhat differentiate each string, we can still win by a very significant amount - just not an enormous amount. The break even point is hard to characterize exactly, but I'm quite optimistic that a large majority of real-world text sorts will see at least some benefit, while a smaller majority will be much, much faster. [1] http://www.unicode.org/reports/tr10/#Notation -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers