On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ft...@vitessedata.com> wrote:
> I feel sorting string as if it is bytea is particularly interesting.  I am
> aware Peter G's patch and I think it is great, but for this sort agg case,
> first, I believe it is still slower than sorting bytea, and second, Peter
> G's patch depends on data.   A common long prefix will make the patch less
> effective, which is probably not so uncommon (for example, URL with a domain
> prefix).  I don't see any downside of sort bytea, other than lost the
> interest ordering.

FWIW, that's probably less true than you'd think. Using Robert's test program:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something";
"http://www.something"; ->
131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444
(59 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another";
"http://www.another"; ->
131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444
(53 bytes)

So the first eight bytes of the first string is 0x131F1F1B2222221E,
and the second 0x131F1F1B2222220C. The last byte is different. That's
because the way the Unicode algorithm [1] works, there is often a
significantly greater concentration of entropy in the first 8 bytes as
compared to raw C strings compared with memcmp() - punctuation
characters and so on are not actually described at the primary weight
level. If we can get even a single byte to somewhat differentiate each
string, we can still win by a very significant amount - just not an
enormous amount. The break even point is hard to characterize exactly,
but I'm quite optimistic that a large majority of real-world text
sorts will see at least some benefit, while a smaller majority will be
much, much faster.

[1] http://www.unicode.org/reports/tr10/#Notation
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to