Bron Gondwana wrote:
I'm in the process of rewriting the lib/mkchartable.c
and lib/charset.c with the eventual goal being a more
flexible charset conversion API that can be used to
make sieve rules match on the decoded values, and
other funky things.
It turns out to be quite a lot of changes. My initial
work in progress is up here:
http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e
As you can see, it's quite a bit of code.
Anyway - I'd like some feedback on a couple of things:
a) It's going to use a little more CPU this way, because
instead of having a table that converts _directly_ from
the source charset to utf-8 in search-canonical-form,
it does one conversion to unicode characters (16bit),
then another table converts that into a stream of zero
to 15 characters (yes, something expands to 15 separate
codepoints, no, I don't want to know what it is!)
Finally a third pass converts to utf-8 from the
character codepoints.
b) Should we make this 32bit unicode characters while we're
at it, and extend the UTF-8 converter?
Yes!
And upgrade the tables to Unicode 5.1.0.
And also change the normalization to conform to RFC 5051.
c) For that matter, should we just be outsourcing all this
crap to another library? Does anyone know a good library
that can do what Cyrus does (take one character at a time
and keep state?)
I am not sure about that, but if people know a good library...