I did some evaluation of the CaseFolding.txt file.
First, the unicode.dll text case routines, must not be handling the exceptions.
If it were, this:
local ustring=unicode.new("Maße")
unicode.messagebox("OK", ustring.change_case("upper"), "upper case")
would display "MASSE"
Instead we get "MAßE"
I think this is preferable anyway for the regex bit.
Second, the unicode plugin is not handling all the possible numeric values
given in the unicode tables. It apparently should go much higher than 65280.
For example:
local test=unicode.from_num(0x10400)
;;error, but should be DESERET CAPITAL LETTER LONG I
unicode.messagebox("ok", test)
Finally, FWIW, even using only the "C" and "S" status types in CaseFolding.txt
(the non-exceptions), there are a few entries where the byte size seems to vary
between the upper and lower case variants of the same letter expressed in utf-8:
unequal utf8 lengths:2 vs 1 ;; 017F 0073 Å¿ s
unequal utf8 lengths:2 vs 3 ;; 023A 2C65 Ⱥ ⱥ
unequal utf8 lengths:2 vs 3 ;; 023E 2C66 Ⱦ ⱦ
unequal utf8 lengths:3 vs 2 ;; 1E9E 00DF ẠÃ
unequal utf8 lengths:3 vs 2 ;; 1FBE 03B9 ι ι
unequal utf8 lengths:3 vs 2 ;; 2126 03C9 ⦠Ï
unequal utf8 lengths:3 vs 1 ;; 212A 006B ⪠k
unequal utf8 lengths:3 vs 2 ;; 212B 00E5 ⫠å
unequal utf8 lengths:3 vs 2 ;; 2C62 026B â±¢ É«
unequal utf8 lengths:3 vs 2 ;; 2C64 027D Ɽ ɽ
unequal utf8 lengths:3 vs 2 ;; 2C6D 0251 â± É`
unequal utf8 lengths:3 vs 2 ;; 2C6E 0271 Ɱ ɱ
unequal utf8 lengths:3 vs 2 ;; 2C6F 0250 Ɐ É
Regards,
Sheri