Re: RFC: Charset Conversion Routines

Alexey Melnikov Tue, 24 Feb 2009 03:14:09 -0800

Bron Gondwana wrote:

I'm in the process of rewriting the lib/mkchartable.cand lib/charset.c with the eventual goal being a more
flexible charset conversion API that can be used to
make sieve rules match on the decoded values, and
other funky things.
It turns out to be quite a lot of changes.  My initial
work in progress is up here:

http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e

As you can see, it's quite a bit of code.


Anyway - I'd like some feedback on a couple of things:

a) It's going to use a little more CPU this way, because
  instead of having a table that converts _directly_ from
the source charset to utf-8 in search-canonical-form,it does one conversion to unicode characters (16bit),then another table converts that into a stream of zeroto 15 characters (yes, something expands to 15 separatecodepoints, no, I don't want to know what it is!)
  Finally a third pass converts to utf-8 from the
  character codepoints.

b) Should we make this 32bit unicode characters while we're
  at it, and extend the UTF-8 converter?

Yes!
And upgrade the tables to Unicode 5.1.0.
And also change the normalization to conform to RFC 5051.

c) For that matter, should we just be outsourcing all this
  crap to another library?  Does anyone know a good library
  that can do what Cyrus does (take one character at a time
  and keep state?)

I am not sure about that, but if people know a good library...

Re: RFC: Charset Conversion Routines

Reply via email to