Re: [HACKERS] Patch for collation using ICU

John Hansen Sat, 07 May 2005 07:12:33 -0700

Bruce Momjian wrote:
> 
> There are two reasons for that optimization --- first, some 
> locale support is broken and Unicode encoding with a C locale 
> crashes (not an issue for ICU), and second, it is an 
> optimization for languages like Japanese that want to use 
> unicode, but don't need a locale because upper/lower means 
> nothing in those character sets.


No, upper/lower means nothing in those languages, so why would you need
to optimize upper/lower if they're not used??
And if they are, it's obviously because the text contains characters
from other languages (probably english) and as such they should behave
correctly.

Did I mention that for japanese and the like, ICU would also offer
transliteration...

> 
> So, the first issue doesn't apply for ICU, and the second 
> might not depending on what characters you are using in the 
> Unicode character set.
> 
> I guess I am little confused how ICU can do upper() when the 
> locale is C.  What is it using to determine A is upper for a? 
>  Am I confused?

Simple, UNICODE basically consist of a table of characters
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)

Excerpt:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
...
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041

From this you can see, that for 0041, which is capital letter A, there
is a mapping to it's lowercase counterpart, 0061
Likewise, there is a mapping for 0061 which says it's uppercase
counterpart is 0041.
There is also SpecialCasing.txt which covers those mappings that haven't
got a 1-1 mapping, such as the german SS.

These mappings are fixed, independent of locale, only a few cases from
specialcasing.txt depend on locale/context.



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Patch for collation using ICU

Reply via email to