Howdy,

[N.B.: I tried to send this a while ago but it didn't get delivered, I'm assuming because, with the uncompressed patch, the email was too big for -hackers. So this is a re-send with the patch gzip'd. Sorry for any duplication].

Please find attached a patch adding a locale-aware, case-insensitive text type, called citext, as a contrib module. A few notes:

* I had originally called it lctext, as it's not a true case- insensitive type, but just converts strings to lowercase before comparing them. I changed it to citext at Tom Lane's suggestion, to ease compatibility for users of the original citext module on pgFoundry.

* Differences from the original citext are:
 + Locale-aware lowercasing of strings, rather than just lowercasing
   ASCII characters.
 + No implicit casts from text to citext except via assignment.
 + A few more functions overloaded
 + Works with 8.3 and CVS head

* Many thanks to whoever added str_tolower() to formatting.c. If I had known about that, I could have saved myself a lot of grief! My original implementation for 8.3.1 had copied a lot of code from oracle_compat.c to get things working. With this patch, I've eliminated a whole lot of code, as I can now just call str_tolower(). So thank you for that! I'll probably keep my original in my personal Subversion repository, but don't now about releasing it if it will be accepted as a contrib module for 8.4.

* All comparisons simply convert the strings to be compared to lowercase using str_tolower(). I've made no other optimizations, though I'm sure someone with more experience with collations and such could add them.

* The regression test uses a new module I've created, now on pgFoundry, called pgtap. It should just work. sql/citext.sql adds plpgsql to the database and then includes pgtap.sql, which has the test functions in it.

* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd work with most West European languages, and maybe all languages other than the C locale, but I'm not sure. YMMV. If there's a way to generalize it and still be able to test the locale awareness, that would be great. What locales do the build farm servers use?

* In the documentation, I've pitched this type as a replacement for the use of LOWER() in ad-hoc queries, while also stipulating that this is not a "true" case-insensitive text type, and is furthermore less efficient than just TEXT (though I'm sure more efficient than ad-hock LOWER()s). I've also mentioned a few other caveats, including casts for TEXT that don't work for citext and non-case-insensitive matching in replace(), regexp_replace(), and a few others.

* I wrote all the code here myself, but of course used the original citext implementation (which is case-insensitive only for ASCII characters) for inspiration and guidance. Thanks to Donald Fraser for that original implementation.

I've compiled the CVS checkout, run its regressions, then built and installed the citext module (hence my discovery of the deprecation of wstring_lower and the addition of str_tolower -- should the declaration of the former be removed from formatting.c?), and all tests passed as of an hour ago.

I of course welcome feedback, advice, insults, commiserations, and just about any mode of comment on this patch. Please let me know if I need to provide any additional information.

Best,

David

Attachment: citext.patch.gz
Description: GNU Zip compressed data


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to