So, just to give a bit more weight to my argument that we should recognise that equivalent strings ought to be treated identically, I direct your attention to conformance requirement C9 of Unicode 3.0:
http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode_3_0_0 This "requires that canonically equivalent text be treated identically. See also C10 and several other places in The Unicode Standard." This page, "DETERMINISTIC SORTING", was also interesting: http://www.unicode.org/notes/tn9/ The same hack that we use in varstr_cmp() appears as psuedo-code, under "Forcing Deterministic Comparisons". I'm not sure in what general sense what *they'd* call a non-deterministic comparison (i.e. plain old strcoll()) is actually non-deterministic. Clearly a "non-deterministic comparison" is deterministic to the extent that given the same two binary strings and encoding and collation, the comparison function will always give the same answer. The article goes on to describe how we could bolt-on a strxfrm() type value to a string, to ensure "deterministic comparison" (i.e. identical behaviour to Postgres master) with pre-processing. Tom did think that this might still be worth it. I'd rather avoid it. The article goes on to say of so-called deterministic comparisons: "However, a deterministic comparison is generally not best practice. First, it has a certain performance cost in comparison, and a quite substantial impact on sort key size. (For example, [ICU] language-sensitive sort keys are generally about the size of the original string, so appending a copy of the original string generally doubles the size of the sort key.) A database using these sort keys can have substantially increased disk footprint and memory footprint, and consequently reduced performance." I note also that the The Single UNIX Specification, Version 2 says of strcoll(): "The strxfrm() and strcmp() functions should be used for sorting large lists". Why has it taken us this long to research this? I am aware that Greg Stark suggested it back in 2005, but I doubt that that was the first time. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers