[power-pro] Re: Unicode bugs? now regex issues

Sheri Mon, 17 Aug 2009 15:34:43 -0700

I did some evaluation of the CaseFolding.txt file.

First, the unicode.dll text case routines, must not be handling the exceptions.


If it were, this:

local ustring=unicode.new("Maße")
unicode.messagebox("OK", ustring.change_case("upper"), "upper case")

would display "MASSE"

Instead we get "MAßE"

I think this is preferable anyway for the regex bit.

Second, the unicode plugin is not handling all the possible numeric values 
given in the unicode tables. It apparently should go much higher than 65280.

For example:
local test=unicode.from_num(0x10400)
;;error, but should be DESERET CAPITAL LETTER LONG I
unicode.messagebox("ok", test)

Finally, FWIW, even using only the "C" and "S" status types in CaseFolding.txt 
(the non-exceptions), there are a few entries where the byte size seems to vary 
between the upper and lower case variants of the same letter expressed in utf-8:

unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Èº â±¥
unequal utf8 lengths:2 vs 3 ;; 023E 2C66 È¾ â±¦
unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF áº Ã
unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 á¾¾ Î¹
unequal utf8 lengths:3 vs 2 ;; 2126 03C9 â¦ Ï
unequal utf8 lengths:3 vs 1 ;; 212A 006B âª k
unequal utf8 lengths:3 vs 2 ;; 212B 00E5 â« Ã¥
unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
unequal utf8 lengths:3 vs 2 ;; 2C64 027D â±¤ É½
unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â± É`
unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 â±® É±
unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 â±¯ É

Regards,
Sheri

[power-pro] Re: Unicode bugs? now regex issues

Reply via email to