Re: [HACKERS] multibyte-character aware support for function "downcase_truncate_identifier()"

Andrew Dunstan Sun, 21 Nov 2010 15:23:02 -0800


On 11/21/2010 06:09 PM, Robert Haas wrote:

I think that's fair.  It actually doesn't seem like it should be that
hard if we knew that the server encoding were UTF8 - it's just a big
translation table somewhere, no?

No, it's far more complex. See for example<http://unicode.org/reports/tr21/tr21-3.html>, which says:


   There are a number of complications to case mappings that occur once
   the repertoire of characters is expanded beyond ASCII.

       * Because of the inclusion of certain composite characters for
         compatibility, such as 01F1 "DZ" /capital dz/, there is a
         third case, called /titlecase/, which is used where the first
         letter of a word is to be capitalized (e.g. Titlecase, vs.
         UPPERCASE, or lowercase).
             o For example, the title case of the example character is
               01F2 "Dz" /capital d with small z/.
       * Case mappings may produce strings of different length than the
         original.
             o For example, the German character 00DF "ß" /small letter
               sharp s/ expands when uppercased to the sequence of two
               characters "SS". This also occurs where there is no
               precomposed character corresponding to a case mapping,
               such as with 0149 "'n" /latin small letter n preceded by
               apostrophe./
       * Characters may also have different case mappings, depending on
         the context.
             o For example, 03A3 "?" /capital sigma/ lowercases to 03C3
               "?" /small sigma/ if it is followed by another letter,
               but lowercases to 03C2 "?" /small final sigma/ if it is not.
       * Characters may have case mappings that depend on the locale.
             o For example, in Turkish the letter 0049 "I" /capital
               letter i/ lowercases to 0131 "?" /small dotless i/.
       * Case mappings are not, in general, reversible.
             o For example, once the string "McGowan" has been
               uppercased, lowercased or titlecased, the original
               cannot be recovered by applying another uppercase,
               lowercase, or titlecase operation.


cheers

andrew

Re: [HACKERS] multibyte-character aware support for function "downcase_truncate_identifier()"

Reply via email to