Re: [HACKERS] Patch for collation using ICU

Palle Girgensohn Sat, 07 May 2005 07:15:47 -0700

--On lördag, maj 07, 2005 09.52.59 -0400 Bruce Momjian <pgman@candle.pha.pa.us> wrote:

Palle Girgensohn wrote:

>> Also, apparently, ICU is installed by default in many linux
>> distributions,  and usually it is version 2.8. Some linux users have
>> asked me if there are  plans for a patch that works with ICU 2.8.
>> That's probably a good idea. IBM  and the ICU folks seem to consider
>> 3.2 to be the stable version, older  versions are hard to find on
>> their sites, but most linux distributers seem  to consider it too
>> bleeding edge, even gentoo. I don't know why they don't  agree.
>
> Good point.  Why would linux folks need ICU?  Doesn't their OS support
> encodings natively?  I am particularly excited about this for OSs that
> don't have such encodings, like UTF8 support for Win32.
>
> Because ICU will not be used unless enabled by configure, it seems we
> are fine with only supporting the newest version.  Do Linux users need
> to use ICU for any reason?


There are corner cases where it is impossible to upper/lowercase one
character at the time. for example:

-- without ICU
 select upper('E?er');
 upper
-------
 E?ER
(1 row)

-- with ICU
select upper('E?er');
 upper
-------
 ESSER
(1 rad)

This is because in the standard postgres implementation, upper/lower is
done one character at the time. A proper upper/lower cannot do it that
way.  Other known example is in Turkish, where an ? (?) should look
different  whether it is an initial letter or not. This fails in
standard postgresql  for all platforms.


Uh, where do you see that?  Our code has:

        workspace = texttowcs(string);

        for (i = 0; workspace[i] != 0; i++)
            workspace[i] = towupper(workspace[i]);

as you see, the loop runs towupper for one character at the time. I cannot consider whether the letter is the initial, as required in Turkish, and it cannot really convert one character into two ('ß' -> 'SS')


        result = wcstotext(workspace, i);

>> Also, in the latest patch, I also added checks and logging for *every*
>> status returned from ICU. I hope this will help debugging on debian,
>> where  previous version didn't work. That excessive status checking is
>> hardly be  necessary once the stuff is better tested.
>>
>> I think the string copying and heap/palloc choices stands for most of
>> the  code bloat, together with the excessive status checking and
>> logging.
>
> OK, move that into some common functions and I think it will be better.

Best way for upper/lower/initcap is probably to use a function
pointer...  uhh...


Uh, I don't think so.  Just send pointers to the the function and let
the function allocate the memory, and another function to free them, or
something like that.  I can probably do it if you want.


I'll check it out, it seems simple enough.

> We have depricated UNICODE in 8.1 in favor of UTF8 (no dash).  Does
> that help?

I'm aware of that. It might help for unicode, but there are a bunch of
other encodings. IANA has decided that utf-8 has *no* aliases, hence
only  utf-8 (with dash, but case insensitve) is accepted. Perhaps ICU is
fogiving, I don't remember/know, but I think we need the mappings,
unfortunately.

OK. I guess I am just confused why the native implementations are OK.

They're OK since they understand that UNICODE (or UTF8) is really utf-8. Problem is the strings used to describe them are not understood by ICU.

BTW, the pg_enc2iananame_tbl is only used *from* internal representation *to* IANA, not the other way around. Maybe that fact lowers the rate of confusion? ;-)

/Palle


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Patch for collation using ICU

Reply via email to