On Sat, Oct 18, 2014 at 5:27 AM, Greg Stark <st...@mit.edu> wrote: > That's interesting but I think it's mostly a quirk of your example. > Afaics the difference is only that the en_US locale ignores > punctuation like : and / (or at least treats them as less significant > than alphabetic characters). If you had strings that had less > punctuation or differences that didn't happen to arrive shortly after > the 8-byte boundary then it wouldn't make any difference.
The en_US locale treats those kind of punctuation characters as less significant than alphabetical characters, but is not at all unusual in doing so - all locales I tested do the same. They also do the same for spaces. And there is another property of the transformation that is very important for those outside of the English speaking world - diacritics are not represented at the primary weight level. So even though representing certain characters with a diacritic will take 2 bytes in UTF-8, the corresponding primary weight only takes 1 byte. In short, as I said, the concentration of entropy can easily be a lot higher within the first n bytes of the primary weight level of the transformed blob as compared to the first n bytes of the original string, and frequently *will* be. This will make worst cases for abbreviated keys significantly less common than they would otherwise be, while more generally improving performance. Of course, that might not always be as effective as we'd prefer, but that's something that you more or less have to live with in order to have competitive sort performance (even with the HyperLogLog amelioration, you still pay something for considering the technique). You can always contrive a case that puts things just out of reach, no matter how much entropy you manage to concentrate in the abbreviated key. Fortunately, I don't think those cases are all that representative of what people want from sort performance. > And we still have to run strfrm at least once, write out the whole > binary blob to memory somewhere and if it spills to disk we still have > to write and read much more data. I think recognizing cases where > equality is the only thing we're interested in and locale-sensitive > sorting isn't necessary and using a memcmp would be a clear win. Yeah. I was making a point that strictly concerned abbreviated keys as proposed in response to Feng's remarks. I am not 100% sure that the benefits of abbreviated keys with locale support aren't worth it here, and I say that in full agreement with Feng about locale-aware sorts not actually being necessary. It's clear that cache performance is essential to getting good sort performance, which strongly recommends abbreviated keys. When we consider the concentration of entropy with "ordinary" locale-aware abbreviated keys, as compared to abbreviated keys that just use the C locale artificially for this case, it's not clear that the concentration of entropy isn't reason enough to prefer locale aware abbreviated keys. Now, it might well be that paying for n strxfrm() operations, rather than doing n straight-up memcpy() operations isn't worth it. But I think it might well be worth it - particularly when you factor in the complexity avoided by not special-casing this. I'm not really sure. The general idea of abbreviated keys is almost old hat, to be honest. It just happens to be essential for competitive sort performance. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers