On 11/21/2010 06:09 PM, Robert Haas wrote:
I think that's fair. It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no?
No, it's far more complex. See for example
<http://unicode.org/reports/tr21/tr21-3.html>, which says:
There are a number of complications to case mappings that occur once
the repertoire of characters is expanded beyond ASCII.
* Because of the inclusion of certain composite characters for
compatibility, such as 01F1 "DZ" /capital dz/, there is a
third case, called /titlecase/, which is used where the first
letter of a word is to be capitalized (e.g. Titlecase, vs.
UPPERCASE, or lowercase).
o For example, the title case of the example character is
01F2 "Dz" /capital d with small z/.
* Case mappings may produce strings of different length than the
original.
o For example, the German character 00DF "ß" /small letter
sharp s/ expands when uppercased to the sequence of two
characters "SS". This also occurs where there is no
precomposed character corresponding to a case mapping,
such as with 0149 "'n" /latin small letter n preceded by
apostrophe./
* Characters may also have different case mappings, depending on
the context.
o For example, 03A3 "?" /capital sigma/ lowercases to 03C3
"?" /small sigma/ if it is followed by another letter,
but lowercases to 03C2 "?" /small final sigma/ if it is not.
* Characters may have case mappings that depend on the locale.
o For example, in Turkish the letter 0049 "I" /capital
letter i/ lowercases to 0131 "?" /small dotless i/.
* Case mappings are not, in general, reversible.
o For example, once the string "McGowan" has been
uppercased, lowercased or titlecased, the original
cannot be recovered by applying another uppercase,
lowercase, or titlecase operation.
cheers
andrew