Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-06-08 Thread Timo Sirainen
On 21.5.2013, at 14.41, Lutz Preßler lutz.press...@sernet.de wrote:

 On Mi, 15 Mai 2013, Timo Sirainen wrote:
 
 On 11.5.2013, at 18.13, Florian Zeitz florob at babelmonkeys.de wrote:
 So... I had a look at this. Turns out that the current implementation of
 Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
 broken. It only handles decomposition properties that include a tag.
 I've attached a hg export that fixes this.
 
 Thanks, added to v2.1 and v2.2 hg.
 
 Thanks, but there seems to be still a problem left. Sender search
 yields all Krüger mails without fts_lucene. But with fts_lucene
 enabled - and files in lucene-indexes/ existing - it's not.
 (If I delete the lucene-index files and search for sender,
 result is correct - but only until they are recreated.)

Fixed finally: http://hg.dovecot.org/dovecot-2.2/rev/7e54af474ea4

Add plugin { fts_lucene = normalize no_snowball } setting (NOTE: this change 
causes all the existing lucene indexes to be rebuilt).

This fts-lucene is getting rather annoying. I wonder if all of this is somehow 
magically solved in Solr.



Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-21 Thread Lutz Preßler
On Mi, 15 Mai 2013, Timo Sirainen wrote:

 On 11.5.2013, at 18.13, Florian Zeitz florob at babelmonkeys.de wrote:
  So... I had a look at this. Turns out that the current implementation of
  Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
  broken. It only handles decomposition properties that include a tag.
  I've attached a hg export that fixes this.
 
 Thanks, added to v2.1 and v2.2 hg.
 
Thanks, but there seems to be still a problem left. Sender search
yields all Krüger mails without fts_lucene. But with fts_lucene
enabled - and files in lucene-indexes/ existing - it's not.
(If I delete the lucene-index files and search for sender,
result is correct - but only until they are recreated.)

Lutz 


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-15 Thread Timo Sirainen
On 11.5.2013, at 18.13, Florian Zeitz flo...@babelmonkeys.de wrote:

 Am 10.05.2013 15:24, schrieb Florian Zeitz:
 Could you elaborate a bit why you think i;unicode-casemap does not
 handle this case?
 
 Is it only applied to the query, but not the header, or vice versa?
 It seems to me that Step 2 should map both inputs to LATIN CAPITAL
 LETTER U + COMBINING DIAERESIS.
 
 Regards,
 Florian
 
 
 So... I had a look at this. Turns out that the current implementation of
 Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
 broken. It only handles decomposition properties that include a tag.
 I've attached a hg export that fixes this.

Thanks, added to v2.1 and v2.2 hg.




Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-11 Thread Florian Zeitz
Am 10.05.2013 15:24, schrieb Florian Zeitz:
 Could you elaborate a bit why you think i;unicode-casemap does not
 handle this case?
 
 Is it only applied to the query, but not the header, or vice versa?
 It seems to me that Step 2 should map both inputs to LATIN CAPITAL
 LETTER U + COMBINING DIAERESIS.
 
 Regards,
 Florian
 

So... I had a look at this. Turns out that the current implementation of
Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
broken. It only handles decomposition properties that include a tag.
I've attached a hg export that fixes this.
# HG changeset patch
# User Florian Zeitz flo...@babelmonkeys.de
# Date 1368284892 -7200
#  Sat May 11 17:08:12 2013 +0200
# Node ID 91f175781d9b75f1617ca5ba50dd58860ef0ae13
# Parent  62874b472dc6e5c30fe7fbc64c1bf868e08bf482
liblib: Fix Unicode decomposition

diff --git a/src/lib/test-unichar.c b/src/lib/test-unichar.c
--- a/src/lib/test-unichar.c
+++ b/src/lib/test-unichar.c
@@ -2,11 +2,15 @@
 
 #include test-lib.h
 #include str.h
+#include buffer.h
 #include unichar.h
 
 void test_unichar(void)
 {
-   static const char *overlong_utf8 = \xf8\x80\x95\x81\xa1;
+   static const char overlong_utf8[] = \xf8\x80\x95\x81\xa1;
+   static const char collate_in[] = \xc3\xbc \xc2\xb3;
+   static const char collate_exp[] = U\xcc\x88 3;
+   buffer_t *collate_out;
unichar_t chr, chr2;
string_t *str = t_str_new(16);
 
@@ -18,6 +22,13 @@
test_assert(uni_utf8_get_char(str_c(str), chr2)  0);
test_assert(chr2 == chr);
}
+
+   collate_out = buffer_create_dynamic(default_pool, 32);
+   uni_utf8_to_decomposed_titlecase(collate_in, sizeof(collate_in),
+collate_out);
+   test_assert(!strcmp(collate_out-data, collate_exp));
+   buffer_free(collate_out);
+
test_assert(!uni_utf8_str_is_valid(overlong_utf8));
test_assert(uni_utf8_get_char(overlong_utf8, chr2)  0);
test_end();
diff --git a/src/lib/unichar.c b/src/lib/unichar.c
--- a/src/lib/unichar.c
+++ b/src/lib/unichar.c
@@ -287,7 +287,7 @@
 
 static bool uni_ucs4_decompose_multi_utf8(unichar_t chr, buffer_t *output)
 {
-   const uint16_t *value;
+   const uint32_t *value;
unsigned int idx;
 
if (chr  multidecomp_keys[0] || chr  0x)
diff --git a/src/lib/unicodemap.pl b/src/lib/unicodemap.pl
--- a/src/lib/unicodemap.pl
+++ b/src/lib/unicodemap.pl
@@ -30,14 +30,14 @@
   push @titlecase32_keys, $code;
   push @titlecase32_values, $value;
 }
-  } elsif ($decomp =~ /\[^]* (.+)/) {
+  } elsif ($decomp =~ /(?:\[^]* )?(.+)/) {
 # decompositions
 my $decomp_codes = $1;
 if ($decomp_codes =~ /^([0-9A-Z]*)$/i) {
   # unicharacter decomposition. use separate lists for this
   my $value = eval(0x$1);
-  if ($value  0x) {
-   print STDERR Error: We've assumed decomposition codes are max. 
16bit\n;
+  if ($value  0x) {
+   print STDERR Error: We've assumed decomposition codes are max. 
32bit\n;
exit 1;
   }
   if ($code = 0xff) {
@@ -61,8 +61,8 @@
 
   foreach my $dcode (split( , $decomp_codes)) {
my $value = eval(0x$dcode);
-   if ($value  0x) {
- print STDERR Error: We've assumed decomposition codes are max. 
16bit\n;
+   if ($value  0x) {
+ print STDERR Error: We've assumed decomposition codes are max. 
32bit\n;
  exit 1;
}
push @multidecomp_values, $value;
@@ -78,7 +78,7 @@
   my $last = $#list;
   my $n = 0;
   foreach my $key (@list) {
-printf(0x%04x, $key);
+printf(0x%05x, $key);
 last if ($n == $last);
 print ,;
 
@@ -137,7 +137,7 @@
 print_list(\@uni16_decomp_keys);
 print \n};\n;
 
-print static const uint16_t uni16_decomp_values[] = {\n\t;
+print static const uint32_t uni16_decomp_values[] = {\n\t;
 print_list(\@uni16_decomp_values);
 print \n};\n;
 
@@ -145,7 +145,7 @@
 print_list(\@uni32_decomp_keys);
 print \n};\n;
 
-print static const uint16_t uni32_decomp_values[] = {\n\t;
+print static const uint32_t uni32_decomp_values[] = {\n\t;
 print_list(\@uni32_decomp_values);
 print \n};\n;
 
@@ -157,6 +157,6 @@
 print_list(\@multidecomp_offsets);
 print \n};\n;
 
-print static const uint16_t multidecomp_values[] = {\n\t;
+print static const uint32_t multidecomp_values[] = {\n\t;
 print_list(\@multidecomp_values);
 print \n};\n;


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-10 Thread Lutz Preßler
Hello Timo,
On Thu, 02 May 2013, Timo Sirainen wrote:

 IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
 Then again, others could be supported as well, and it's not really a 
 requirement that the search can't handle more flexible searches.. Anyway, 
 that's what Dovecot currently has implemented, and I guess it doesn't do what 
 you want it to do. But there is a partial solution for this:
 
 http://dovecot.org/patches/2.1/icu-1.2.tar.gz
 
 It probably does what you want, but it only works with fts-lucene.
I'm trying to test it with the 2.2.1 installation, but have a problem
doing so: after seemingly smooth compilation and installation, I get

May 10 14:15:18 host dovecot: imap: Error: Module is for different ABI version 
2.2.1 (we have 2.2.ABIv0(2.2.1)): /usr/lib/dovecot/modules/lib20_icu_plugin.so
May 10 14:15:18 host dovecot: imap: Fatal: Couldn't load required plugins

Any idea?

Greetings,
  Lutz


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-10 Thread Florian Zeitz
Am 02.05.2013 17:53, schrieb Timo Sirainen:
 On 25.4.2013, at 16.39, Lutz Preßler lutz.press...@sernet.de wrote:
 
 on a system with dovecot 2.2 I've got a mailbox containing multiple mails
 from a person called Krüger, but From: header encoded differently.
 Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
 that is u and umlaut accent as sperate combined codepoints
 instead of one ü:

  From: =?utf-8?Q?replaced_Kru=CC=88ger?= krueger@some.domain

 Searching within roundcube webmail for krüger as sender
 missis this mails.

 Roundcube sends (dovecot rawlog):
 A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

 Is this supposed to work? Haven't done any more debugging
 (other search variants) or read RFCs. As a user I would expect
 Unicode equivalence rules be applied (see 
 http://en.wikipedia.org/wiki/Unicode_equivalence)
 
 IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
 Then again, others could be supported as well, and it's not really a 
 requirement that the search can't handle more flexible searches.. Anyway, 
 that's what Dovecot currently has implemented, and I guess it doesn't do what 
 you want it to do. But there is a partial solution for this:
 
 http://dovecot.org/patches/2.1/icu-1.2.tar.gz
 
 It probably does what you want, but it only works with fts-lucene.
 
Could you elaborate a bit why you think i;unicode-casemap does not
handle this case?

Is it only applied to the query, but not the header, or vice versa?
It seems to me that Step 2 should map both inputs to LATIN CAPITAL
LETTER U + COMBINING DIAERESIS.

Regards,
Florian


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-02 Thread Timo Sirainen
On 25.4.2013, at 16.39, Lutz Preßler lutz.press...@sernet.de wrote:

 on a system with dovecot 2.2 I've got a mailbox containing multiple mails
 from a person called Krüger, but From: header encoded differently.
 Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
 that is u and umlaut accent as sperate combined codepoints
 instead of one ü:
 
  From: =?utf-8?Q?replaced_Kru=CC=88ger?= krueger@some.domain
 
 Searching within roundcube webmail for krüger as sender
 missis this mails.
 
 Roundcube sends (dovecot rawlog):
 A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
 
 Is this supposed to work? Haven't done any more debugging
 (other search variants) or read RFCs. As a user I would expect
 Unicode equivalence rules be applied (see 
 http://en.wikipedia.org/wiki/Unicode_equivalence)

IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
Then again, others could be supported as well, and it's not really a 
requirement that the search can't handle more flexible searches.. Anyway, 
that's what Dovecot currently has implemented, and I guess it doesn't do what 
you want it to do. But there is a partial solution for this:

http://dovecot.org/patches/2.1/icu-1.2.tar.gz

It probably does what you want, but it only works with fts-lucene.



[Dovecot] search and UTF-8 normalization forms (NFD)

2013-04-25 Thread Lutz Preßler
Hello,

on a system with dovecot 2.2 I've got a mailbox containing multiple mails
from a person called Krüger, but From: header encoded differently.
Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
that is u and umlaut accent as sperate combined codepoints
instead of one ü:

  From: =?utf-8?Q?replaced_Kru=CC=88ger?= krueger@some.domain

Searching within roundcube webmail for krüger as sender
missis this mails.

Roundcube sends (dovecot rawlog):
A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

Is this supposed to work? Haven't done any more debugging
(other search variants) or read RFCs. As a user I would expect
Unicode equivalence rules be applied (see 
http://en.wikipedia.org/wiki/Unicode_equivalence)

Regards,
  Lutz