--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote:
Palle Girgensohn wrote:>> Also, apparently, ICU is installed by default in many linux >> distributions, and usually it is version 2.8. Some linux users have >> asked me if there are plans for a patch that works with ICU 2.8. >> That's probably a good idea. IBM and the ICU folks seem to consider >> 3.2 to be the stable version, older versions are hard to find on >> their sites, but most linux distributers seem to consider it too >> bleeding edge, even gentoo. I don't know why they don't agree. > > Good point. Why would linux folks need ICU? Doesn't their OS support > encodings natively? I am particularly excited about this for OSs that > don't have such encodings, like UTF8 support for Win32. > > Because ICU will not be used unless enabled by configure, it seems we > are fine with only supporting the newest version. Do Linux users need > to use ICU for any reason?
There are corner cases where it is impossible to upper/lowercase one character at the time. for example:
-- without ICU select upper('E?er'); upper ------- E?ER (1 row)
-- with ICU select upper('E?er'); upper ------- ESSER (1 rad)
This is because in the standard postgres implementation, upper/lower is done one character at the time. A proper upper/lower cannot do it that way. Other known example is in Turkish, where an ? (?) should look different whether it is an initial letter or not. This fails in standard postgresql for all platforms.
Uh, where do you see that? Our code has:
workspace = texttowcs(string);
for (i = 0; workspace[i] != 0; i++) workspace[i] = towupper(workspace[i]);
as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('ß' -> 'SS')
result = wcstotext(workspace, i);
>> Also, in the latest patch, I also added checks and logging for *every* >> status returned from ICU. I hope this will help debugging on debian, >> where previous version didn't work. That excessive status checking is >> hardly be necessary once the stuff is better tested. >> >> I think the string copying and heap/palloc choices stands for most of >> the code bloat, together with the excessive status checking and >> logging. > > OK, move that into some common functions and I think it will be better.
Best way for upper/lower/initcap is probably to use a function pointer... uhh...
Uh, I don't think so. Just send pointers to the the function and let the function allocate the memory, and another function to free them, or something like that. I can probably do it if you want.
I'll check it out, it seems simple enough.
> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash). Does > that help?
I'm aware of that. It might help for unicode, but there are a bunch of other encodings. IANA has decided that utf-8 has *no* aliases, hence only utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is fogiving, I don't remember/know, but I think we need the mappings, unfortunately.
OK. I guess I am just confused why the native implementations are OK.
They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU.
BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-)
/Palle
---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives?
http://archives.postgresql.org