Re: [HACKERS] Patch for collation using ICU

Palle Girgensohn Fri, 25 Mar 2005 18:13:07 -0800

--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn <[EMAIL PROTECTED]> wrote:

Hi!

I've put together a patch for using IBM's ICU package for collation.

If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might be useful. If you
are using a multibyte character encoding in your database and want
collation, i.e. order by, and also lower(), upper() and initcap() to work
properly, this patch will do just that.

This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not do
what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
patch is *not* necessary for Linux, although IBM claims ICU collation to
be about twice as fast as glibc for simple western locales.

It adds a configure switch, `--with-icu', which will set up the code to
use ICU instead of wchar_t and wcscoll.

This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance tests yet, but it seems it is at least not slower than using
LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.

I'd be delighted if some more experienced postgresql hackers would review
this stuff. The patch is pretty compact, so it's fast reading :)  I'm
planning to add this patch as an option (tagged "experimental") to
FreeBSD's postgresql port. Any ideas about whether this is a good idea or
not?

Any thoughts or ideas are welcome!

Cheers,
Palle

Patch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d
iff>

ICU at sourceforge: <http://icu.sf.net/>

Hi!

There's a new patch to fix some reported problems.

<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff>

This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway?

I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1:

- ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice :)

- When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('��') -> '��'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb in the C locale, but only now do I realize how broken that really is if you need to store anything else than English :-)

I'd be delighted to get more feedback about this stuff.

Thanks,
Palle


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Patch for collation using ICU

Reply via email to