Re: How to use non-ascii charsets with sieve?

2002-12-10 Thread Mark Keasling
Hi,


On Mon, 09 Dec 2002 22:17:06 -0500, Lawrence Greenfield [EMAIL PROTECTED] wrote...
 --On Tuesday, December 10, 2002 11:52 AM +0900 Mark Keasling 
 [EMAIL PROTECTED] wrote:
 
  Hi Larry,
 
 [ ... decode in fill_cache() ... ]
  This hasn't been tested this yet since I stuck it in yesterday before
  going home and have just returned to the office.  It should decode
  subjects into utf8.  But it may have interesting unintended
  side-effects.  So far we are only interested in decoded subjects.  But
  decoding the comment part of addresses also has a high probability of
  being desired.  Depends on the feed-back we get from users.
 
  Will charset_decode1522( ) strip the whitespace?
 
 Yes, the output of charset_decode1522() is intended to be fed into the 
 Cyrus IMAP SEARCH algorithm, which ignores whitespace. It also does case 
 folding, preventing i;octet searches from working.
 
 charset_decode1522() would work if it was using a different transcoding 
 table than what is generated in the lib/ directory.

I'm in the process of trying to figure out how this stuff works...
Is it possible to separate the charset to utf-8 conversion from the text to
search data transformation?

Regards,
Mark Keasling [EMAIL PROTECTED]




Re: How to use non-ascii charsets with sieve?

2002-12-10 Thread Ken Murchison
I dug up the patch I have for creating a separate Sieve charset table. 
I have no idea if it will still apply cleanly due to its age, but it
should point you at the places to look in the code.  If you can find a
way to make one unified table as Larry suggests, that would be great.


Mark Keasling wrote:
 
 Hi,
 
 On Mon, 09 Dec 2002 22:17:06 -0500, Lawrence Greenfield [EMAIL PROTECTED] 
wrote...
  --On Tuesday, December 10, 2002 11:52 AM +0900 Mark Keasling
  [EMAIL PROTECTED] wrote:
 
   Hi Larry,
 
  [ ... decode in fill_cache() ... ]
   This hasn't been tested this yet since I stuck it in yesterday before
   going home and have just returned to the office.  It should decode
   subjects into utf8.  But it may have interesting unintended
   side-effects.  So far we are only interested in decoded subjects.  But
   decoding the comment part of addresses also has a high probability of
   being desired.  Depends on the feed-back we get from users.
  
   Will charset_decode1522( ) strip the whitespace?
 
  Yes, the output of charset_decode1522() is intended to be fed into the
  Cyrus IMAP SEARCH algorithm, which ignores whitespace. It also does case
  folding, preventing i;octet searches from working.
 
  charset_decode1522() would work if it was using a different transcoding
  table than what is generated in the lib/ directory.
 
 I'm in the process of trying to figure out how this stuff works...
 Is it possible to separate the charset to utf-8 conversion from the text to
 search data transformation?
 
 Regards,
 Mark Keasling [EMAIL PROTECTED]

-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp


sieve-mime.patch.gz
Description: GNU Zip compressed data


Re: How to use non-ascii charsets with sieve?

2002-12-10 Thread Lawrence Greenfield
   Date: Tue, 10 Dec 2002 19:07:55 +0900 (JST)
   From: Mark Keasling [EMAIL PROTECTED]
[...]
   I'm in the process of trying to figure out how this stuff works...
   Is it possible to separate the charset to utf-8 conversion from the text to
   search data transformation?

It would be technically possible. It's probably not the easiest thing
to do in the Cyrus code base.

Currently mkchartable.c does casemapping, character decomposition, and
whitespace elimination. It also applies some mappings
(charset/unifix.txt) that help with a language independant match but
may not be appropriate for collation or all UTF-8 comparators.

To make the chartable stuff work for Sieve  our current SEARCH, we
probably should build tables that just output decomposed (or fully
composed) UTF-8 characters.

We can then write a UTF-8 comparator library that, during comparison,
does the canonicalization.

The easier path to make Sieve work would be to just build two
completely seperate tables. I'd prefer to see the more general
solution.

While none of this is rocket science, it is heavily detailed oriented
and requires concentration.

Larry





How to use non-ascii charsets with sieve?

2002-12-09 Thread Mark Keasling
Hi,

Can someone give me "how to" pointer...

I need to know how to use non-ASCII text in sieve scripts.
For example: using Japanese in message headers or mailbox names.

For example a message has a subject as follows:

題名: アクセシビリティセミナー報告

it is MIME-encoded as:

Subject: =?ISO-2022-JP?B?GyRCJSIlLyU7JTclUyVqJUYlIyU7JV8lSiE8SnM5cBsoQg==?=


script language="sieve" version="RFC-3028"
  # pretend this is encoded in UTF-8

  require ["reject","fileinto"];

  if header :contains "Subject" "セミナー報告"
  {
fileinto "セミナー報告" ;
  }
/script

I don't know how the make timsieved decode mime headers or
MUTF-7 encode mailbox names.

Regards,
Mark Keasling [EMAIL PROTECTED]



Re: How to use non-ascii charsets with sieve?

2002-12-09 Thread Ken Murchison


Lawrence Greenfield wrote:
 
 You bring up good questions.
 
 First, our Sieve implementation currently doesn't deal with RFC 2047
 encoded headers---or rather, it just compares the undecoded headers
 against the UTF-8 string. This is obviously a bug which sadly isn't in
 bugzilla.
 
 Ken and I talked (a long time ago) about this. The main issue is that
 Cyrus's character comparison routines remove whitespace and always
 perform casemapping, and this is probably inappropriate for Sieve's
 use. Fixing this is probably not difficult, but I'd prefer not to have
 multiple different canonicalization tables.

I _think_ I still have the code around which implements the Sieve
charset tables and does the rfc2047 decoding.  I don't recall why we had
to have the separate tables however.

-- 
Kenneth Murchison Oceana Matrix Ltd.
Software Engineer 21 Princeton Place
716-662-8973 x26  Orchard Park, NY 14127
--PGP Public Key--http://www.oceana.com/~ken/ksm.pgp



Re: How to use non-ascii charsets with sieve?

2002-12-09 Thread Tim Showalter
First, our Sieve implementation currently doesn't deal with RFC 2047
encoded headers---or rather, it just compares the undecoded headers
against the UTF-8 string. This is obviously a bug which sadly isn't in
bugzilla.

Ken and I talked (a long time ago) about this. The main issue is that
Cyrus's character comparison routines remove whitespace and always
perform casemapping, and this is probably inappropriate for Sieve's
use. Fixing this is probably not difficult, but I'd prefer not to have
multiple different canonicalization tables.


I _think_ I still have the code around which implements the Sieve
charset tables and does the rfc2047 decoding.  I don't recall why we had
to have the separate tables however.


different comparators would require different tables, I think.  The 
table Cyrus usually uses isn't suitable for i;ascii-casemap since space 
isn't significant, but transcoding to UTF-8 and doing a dumb comparison 
is all that's required, a big improvement on what Cyrus is doing now, 
and not hard to implement.

Tim



Re: How to use non-ascii charsets with sieve?

2002-12-09 Thread Mark Keasling
Hi Larry,

We are considering a modification like this to fill_cache(message_data_t *)
in cyrus-imapd-2.1.11/sieve/test.c

%%SNIP%%
void fill_cache(message_data_t *m)
{
rewind(m-data);

/* let's fill that header cache */
for (;;) {
char *name, *body;
int cl, clinit;

if (parseheader(m-data, name, body)  0) {
break;
}

#ifdef DECODE_SUBJECT
/* decode mime encoded subjects */
if( name  * name  ! strcmp( name, "subject" )
 body  * body  strstr( body, "=?" ))
{
char * de = charset_decode1522( body, NULL, 0 ) ;
if( decoded  * decoded )
{
free( body ) ;
body = decoded ;
}
}
#endif /* DECODE_SUBJECT */

%%SNIP%%

This hasn't been tested this yet since I stuck it in yesterday before
going home and have just returned to the office.  It should decode subjects
into utf8.  But it may have "interesting" unintended side-effects.  So far
we are only interested in decoded subjects.  But decoding the comment part
of addresses also has a high probability of being desired.  Depends on the
feed-back we get from users.

Will charset_decode1522( ) strip the whitespace?
Someone else found the function and I have only given it the most cursory
glance over.

On Mon, 9 Dec 2002 15:59:38 -0500, Lawrence Greenfield [EMAIL PROTECTED] wrote...
 You bring up good questions.
 
 First, our Sieve implementation currently doesn't deal with RFC 2047
 encoded headers---or rather, it just compares the undecoded headers
 against the UTF-8 string. This is obviously a bug which sadly isn't in
 bugzilla.
 
 Ken and I talked (a long time ago) about this. The main issue is that
 Cyrus's character comparison routines remove whitespace and always
 perform casemapping, and this is probably inappropriate for Sieve's
 use. Fixing this is probably not difficult, but I'd prefer not to have
 multiple different canonicalization tables.
 
 The "fileinto" problem is more straightforward and should be fixed in
 lmtpd.c:sieve_fileinto().
 
 I would add a function to mboxname.[ch] of mboxname_utf8tomutf7() and
 then make sieve_fileinto() call it.
 
 Larry

Thank You!  This is very NICE as I hadn't gotten far enough along to look
at this yet.  The obvious work around (ugly hack) for fileinto is to have
the client do the mutf-7 conversion before submitting the script.  We're
working on the client so such a hack isn't out of the question; but probably
wont work well if some other client were to access the server.

 
Date: Mon, 9 Dec 2002 19:53:37 +0900 (JST)
From: Mark Keasling [EMAIL PROTECTED]
 [...]
script language="sieve" version="RFC-3028"
  # pretend this is encoded in UTF-8
 
  require ["reject","fileinto"];
 
  if header :contains "Subject" "セミナー報告"
  {
fileinto "セミナー報告" ;
  }
/script
 
I don't know how the make timsieved decode mime headers or
MUTF-7 encode mailbox names.

Regards,
Mark Keasling [EMAIL PROTECTED]



Re: How to use non-ascii charsets with sieve?

2002-12-09 Thread Lawrence Greenfield
--On Monday, December 09, 2002 6:01 PM -0800 Tim Showalter [EMAIL PROTECTED] 
wrote:

different comparators would require different tables, I think.  The table
Cyrus usually uses isn't suitable for i;ascii-casemap since space isn't
significant, but transcoding to UTF-8 and doing a dumb comparison is all
that's required, a big improvement on what Cyrus is doing now, and not
hard to implement.


Yes, unlike the IMAP SEARCH command the Sieve comparators have strict 
semantics. I believe it's possible to implement the Sieve comparators and 
Cyrus's current SEARCH comparator with a single table. The table would 
transcode into UTF-8 (fully decomposed) and the SEARCH comparator could do 
the modifications needed.

Larry