Re: How to use non-ascii charsets with sieve?
Hi, On Mon, 09 Dec 2002 22:17:06 -0500, Lawrence Greenfield [EMAIL PROTECTED] wrote... --On Tuesday, December 10, 2002 11:52 AM +0900 Mark Keasling [EMAIL PROTECTED] wrote: Hi Larry, [ ... decode in fill_cache() ... ] This hasn't been tested this yet since I stuck it in yesterday before going home and have just returned to the office. It should decode subjects into utf8. But it may have interesting unintended side-effects. So far we are only interested in decoded subjects. But decoding the comment part of addresses also has a high probability of being desired. Depends on the feed-back we get from users. Will charset_decode1522( ) strip the whitespace? Yes, the output of charset_decode1522() is intended to be fed into the Cyrus IMAP SEARCH algorithm, which ignores whitespace. It also does case folding, preventing i;octet searches from working. charset_decode1522() would work if it was using a different transcoding table than what is generated in the lib/ directory. I'm in the process of trying to figure out how this stuff works... Is it possible to separate the charset to utf-8 conversion from the text to search data transformation? Regards, Mark Keasling [EMAIL PROTECTED]
Re: How to use non-ascii charsets with sieve?
I dug up the patch I have for creating a separate Sieve charset table. I have no idea if it will still apply cleanly due to its age, but it should point you at the places to look in the code. If you can find a way to make one unified table as Larry suggests, that would be great. Mark Keasling wrote: Hi, On Mon, 09 Dec 2002 22:17:06 -0500, Lawrence Greenfield [EMAIL PROTECTED] wrote... --On Tuesday, December 10, 2002 11:52 AM +0900 Mark Keasling [EMAIL PROTECTED] wrote: Hi Larry, [ ... decode in fill_cache() ... ] This hasn't been tested this yet since I stuck it in yesterday before going home and have just returned to the office. It should decode subjects into utf8. But it may have interesting unintended side-effects. So far we are only interested in decoded subjects. But decoding the comment part of addresses also has a high probability of being desired. Depends on the feed-back we get from users. Will charset_decode1522( ) strip the whitespace? Yes, the output of charset_decode1522() is intended to be fed into the Cyrus IMAP SEARCH algorithm, which ignores whitespace. It also does case folding, preventing i;octet searches from working. charset_decode1522() would work if it was using a different transcoding table than what is generated in the lib/ directory. I'm in the process of trying to figure out how this stuff works... Is it possible to separate the charset to utf-8 conversion from the text to search data transformation? Regards, Mark Keasling [EMAIL PROTECTED] -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp sieve-mime.patch.gz Description: GNU Zip compressed data
Re: How to use non-ascii charsets with sieve?
Date: Tue, 10 Dec 2002 19:07:55 +0900 (JST) From: Mark Keasling [EMAIL PROTECTED] [...] I'm in the process of trying to figure out how this stuff works... Is it possible to separate the charset to utf-8 conversion from the text to search data transformation? It would be technically possible. It's probably not the easiest thing to do in the Cyrus code base. Currently mkchartable.c does casemapping, character decomposition, and whitespace elimination. It also applies some mappings (charset/unifix.txt) that help with a language independant match but may not be appropriate for collation or all UTF-8 comparators. To make the chartable stuff work for Sieve our current SEARCH, we probably should build tables that just output decomposed (or fully composed) UTF-8 characters. We can then write a UTF-8 comparator library that, during comparison, does the canonicalization. The easier path to make Sieve work would be to just build two completely seperate tables. I'd prefer to see the more general solution. While none of this is rocket science, it is heavily detailed oriented and requires concentration. Larry
How to use non-ascii charsets with sieve?
Hi, Can someone give me "how to" pointer... I need to know how to use non-ASCII text in sieve scripts. For example: using Japanese in message headers or mailbox names. For example a message has a subject as follows: 題名: アクセシビリティセミナー報告 it is MIME-encoded as: Subject: =?ISO-2022-JP?B?GyRCJSIlLyU7JTclUyVqJUYlIyU7JV8lSiE8SnM5cBsoQg==?= script language="sieve" version="RFC-3028" # pretend this is encoded in UTF-8 require ["reject","fileinto"]; if header :contains "Subject" "セミナー報告" { fileinto "セミナー報告" ; } /script I don't know how the make timsieved decode mime headers or MUTF-7 encode mailbox names. Regards, Mark Keasling [EMAIL PROTECTED]
Re: How to use non-ascii charsets with sieve?
Lawrence Greenfield wrote: You bring up good questions. First, our Sieve implementation currently doesn't deal with RFC 2047 encoded headers---or rather, it just compares the undecoded headers against the UTF-8 string. This is obviously a bug which sadly isn't in bugzilla. Ken and I talked (a long time ago) about this. The main issue is that Cyrus's character comparison routines remove whitespace and always perform casemapping, and this is probably inappropriate for Sieve's use. Fixing this is probably not difficult, but I'd prefer not to have multiple different canonicalization tables. I _think_ I still have the code around which implements the Sieve charset tables and does the rfc2047 decoding. I don't recall why we had to have the separate tables however. -- Kenneth Murchison Oceana Matrix Ltd. Software Engineer 21 Princeton Place 716-662-8973 x26 Orchard Park, NY 14127 --PGP Public Key--http://www.oceana.com/~ken/ksm.pgp
Re: How to use non-ascii charsets with sieve?
First, our Sieve implementation currently doesn't deal with RFC 2047 encoded headers---or rather, it just compares the undecoded headers against the UTF-8 string. This is obviously a bug which sadly isn't in bugzilla. Ken and I talked (a long time ago) about this. The main issue is that Cyrus's character comparison routines remove whitespace and always perform casemapping, and this is probably inappropriate for Sieve's use. Fixing this is probably not difficult, but I'd prefer not to have multiple different canonicalization tables. I _think_ I still have the code around which implements the Sieve charset tables and does the rfc2047 decoding. I don't recall why we had to have the separate tables however. different comparators would require different tables, I think. The table Cyrus usually uses isn't suitable for i;ascii-casemap since space isn't significant, but transcoding to UTF-8 and doing a dumb comparison is all that's required, a big improvement on what Cyrus is doing now, and not hard to implement. Tim
Re: How to use non-ascii charsets with sieve?
Hi Larry, We are considering a modification like this to fill_cache(message_data_t *) in cyrus-imapd-2.1.11/sieve/test.c %%SNIP%% void fill_cache(message_data_t *m) { rewind(m-data); /* let's fill that header cache */ for (;;) { char *name, *body; int cl, clinit; if (parseheader(m-data, name, body) 0) { break; } #ifdef DECODE_SUBJECT /* decode mime encoded subjects */ if( name * name ! strcmp( name, "subject" ) body * body strstr( body, "=?" )) { char * de = charset_decode1522( body, NULL, 0 ) ; if( decoded * decoded ) { free( body ) ; body = decoded ; } } #endif /* DECODE_SUBJECT */ %%SNIP%% This hasn't been tested this yet since I stuck it in yesterday before going home and have just returned to the office. It should decode subjects into utf8. But it may have "interesting" unintended side-effects. So far we are only interested in decoded subjects. But decoding the comment part of addresses also has a high probability of being desired. Depends on the feed-back we get from users. Will charset_decode1522( ) strip the whitespace? Someone else found the function and I have only given it the most cursory glance over. On Mon, 9 Dec 2002 15:59:38 -0500, Lawrence Greenfield [EMAIL PROTECTED] wrote... You bring up good questions. First, our Sieve implementation currently doesn't deal with RFC 2047 encoded headers---or rather, it just compares the undecoded headers against the UTF-8 string. This is obviously a bug which sadly isn't in bugzilla. Ken and I talked (a long time ago) about this. The main issue is that Cyrus's character comparison routines remove whitespace and always perform casemapping, and this is probably inappropriate for Sieve's use. Fixing this is probably not difficult, but I'd prefer not to have multiple different canonicalization tables. The "fileinto" problem is more straightforward and should be fixed in lmtpd.c:sieve_fileinto(). I would add a function to mboxname.[ch] of mboxname_utf8tomutf7() and then make sieve_fileinto() call it. Larry Thank You! This is very NICE as I hadn't gotten far enough along to look at this yet. The obvious work around (ugly hack) for fileinto is to have the client do the mutf-7 conversion before submitting the script. We're working on the client so such a hack isn't out of the question; but probably wont work well if some other client were to access the server. Date: Mon, 9 Dec 2002 19:53:37 +0900 (JST) From: Mark Keasling [EMAIL PROTECTED] [...] script language="sieve" version="RFC-3028" # pretend this is encoded in UTF-8 require ["reject","fileinto"]; if header :contains "Subject" "セミナー報告" { fileinto "セミナー報告" ; } /script I don't know how the make timsieved decode mime headers or MUTF-7 encode mailbox names. Regards, Mark Keasling [EMAIL PROTECTED]
Re: How to use non-ascii charsets with sieve?
--On Monday, December 09, 2002 6:01 PM -0800 Tim Showalter [EMAIL PROTECTED] wrote: different comparators would require different tables, I think. The table Cyrus usually uses isn't suitable for i;ascii-casemap since space isn't significant, but transcoding to UTF-8 and doing a dumb comparison is all that's required, a big improvement on what Cyrus is doing now, and not hard to implement. Yes, unlike the IMAP SEARCH command the Sieve comparators have strict semantics. I believe it's possible to implement the Sieve comparators and Cyrus's current SEARCH comparator with a single table. The table would transcode into UTF-8 (fully decomposed) and the SEARCH comparator could do the modifications needed. Larry