I'm in the process of rewriting the lib/mkchartable.c and lib/charset.c with the eventual goal being a more flexible charset conversion API that can be used to make sieve rules match on the decoded values, and other funky things.
It turns out to be quite a lot of changes. My initial work in progress is up here: http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e As you can see, it's quite a bit of code. Anyway - I'd like some feedback on a couple of things: a) It's going to use a little more CPU this way, because instead of having a table that converts _directly_ from the source charset to utf-8 in search-canonical-form, it does one conversion to unicode characters (16bit), then another table converts that into a stream of zero to 15 characters (yes, something expands to 15 separate codepoints, no, I don't want to know what it is!) Finally a third pass converts to utf-8 from the character codepoints. b) Should we make this 32bit unicode characters while we're at it, and extend the UTF-8 converter? c) For that matter, should we just be outsourcing all this crap to another library? Does anyone know a good library that can do what Cyrus does (take one character at a time and keep state?) d) Whitespace compression. I'm currently mapping all whitespace to ' ' instead of '', and then either stripping all ' ' from the string, or only outputting them if the previous character on the output string was not a space. Rob tells me that there are some issues with asian charsets and space not having any meaning - how best to handle? e) Interfaces, interfaces, interfaces. At the moment we have: * charset_compilepat - for use in: * charset_searchstring * charset_searchfile * charset_decode_mimebody - and * charset_encode_mimebody * charset_extractfile My current implementation that I'm working on uses "int flags" as an extra parameter to each of these, allowing CHARSET_CANON and CHARSET_STRIPSPACE to be passed down to the translation layer. Would people be happy with that as an interface? It's somewhat invasive, needing changes through lots of imap/*.c and sieve/*.c files. Bron.
