Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-06-08 Thread Timo Sirainen
On 21.5.2013, at 14.41, Lutz Preßler  wrote:

> On Mi, 15 Mai 2013, Timo Sirainen wrote:
> 
>> On 11.5.2013, at 18.13, Florian Zeitz  wrote:
>>> So... I had a look at this. Turns out that the current implementation of
>>> Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
>>> broken. It only handles decomposition properties that include a tag.
>>> I've attached a hg export that fixes this.
>> 
>> Thanks, added to v2.1 and v2.2 hg.
>> 
> Thanks, but there seems to be still a problem left. Sender search
> yields all Krüger mails without fts_lucene. But with fts_lucene
> enabled - and files in lucene-indexes/ existing - it's not.
> (If I delete the lucene-index files and search for sender,
> result is correct - but only until they are recreated.)

Fixed finally: http://hg.dovecot.org/dovecot-2.2/rev/7e54af474ea4

Add plugin { fts_lucene = normalize no_snowball } setting (NOTE: this change 
causes all the existing lucene indexes to be rebuilt).

This fts-lucene is getting rather annoying. I wonder if all of this is somehow 
magically solved in Solr.



Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-21 Thread Lutz Preßler
On Mi, 15 Mai 2013, Timo Sirainen wrote:

> On 11.5.2013, at 18.13, Florian Zeitz  wrote:
> > So... I had a look at this. Turns out that the current implementation of
> > Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
> > broken. It only handles decomposition properties that include a tag.
> > I've attached a hg export that fixes this.
> 
> Thanks, added to v2.1 and v2.2 hg.
> 
Thanks, but there seems to be still a problem left. Sender search
yields all Krüger mails without fts_lucene. But with fts_lucene
enabled - and files in lucene-indexes/ existing - it's not.
(If I delete the lucene-index files and search for sender,
result is correct - but only until they are recreated.)

Lutz 


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-15 Thread Timo Sirainen
On 11.5.2013, at 18.13, Florian Zeitz  wrote:

> Am 10.05.2013 15:24, schrieb Florian Zeitz:
>> Could you elaborate a bit why you think i;unicode-casemap does not
>> handle this case?
>> 
>> Is it only applied to the query, but not the header, or vice versa?
>> It seems to me that Step 2 should map both inputs to LATIN CAPITAL
>> LETTER U + COMBINING DIAERESIS.
>> 
>> Regards,
>> Florian
>> 
> 
> So... I had a look at this. Turns out that the current implementation of
> Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
> broken. It only handles decomposition properties that include a tag.
> I've attached a hg export that fixes this.

Thanks, added to v2.1 and v2.2 hg.




Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-11 Thread Florian Zeitz
Am 10.05.2013 15:24, schrieb Florian Zeitz:
> Could you elaborate a bit why you think i;unicode-casemap does not
> handle this case?
> 
> Is it only applied to the query, but not the header, or vice versa?
> It seems to me that Step 2 should map both inputs to LATIN CAPITAL
> LETTER U + COMBINING DIAERESIS.
> 
> Regards,
> Florian
> 

So... I had a look at this. Turns out that the current implementation of
Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is
broken. It only handles decomposition properties that include a tag.
I've attached a hg export that fixes this.
# HG changeset patch
# User Florian Zeitz 
# Date 1368284892 -7200
#  Sat May 11 17:08:12 2013 +0200
# Node ID 91f175781d9b75f1617ca5ba50dd58860ef0ae13
# Parent  62874b472dc6e5c30fe7fbc64c1bf868e08bf482
liblib: Fix Unicode decomposition

diff --git a/src/lib/test-unichar.c b/src/lib/test-unichar.c
--- a/src/lib/test-unichar.c
+++ b/src/lib/test-unichar.c
@@ -2,11 +2,15 @@
 
 #include "test-lib.h"
 #include "str.h"
+#include "buffer.h"
 #include "unichar.h"
 
 void test_unichar(void)
 {
-   static const char *overlong_utf8 = "\xf8\x80\x95\x81\xa1";
+   static const char overlong_utf8[] = "\xf8\x80\x95\x81\xa1";
+   static const char collate_in[] = "\xc3\xbc \xc2\xb3";
+   static const char collate_exp[] = "U\xcc\x88 3";
+   buffer_t *collate_out;
unichar_t chr, chr2;
string_t *str = t_str_new(16);
 
@@ -18,6 +22,13 @@
test_assert(uni_utf8_get_char(str_c(str), &chr2) > 0);
test_assert(chr2 == chr);
}
+
+   collate_out = buffer_create_dynamic(default_pool, 32);
+   uni_utf8_to_decomposed_titlecase(collate_in, sizeof(collate_in),
+collate_out);
+   test_assert(!strcmp(collate_out->data, collate_exp));
+   buffer_free(&collate_out);
+
test_assert(!uni_utf8_str_is_valid(overlong_utf8));
test_assert(uni_utf8_get_char(overlong_utf8, &chr2) < 0);
test_end();
diff --git a/src/lib/unichar.c b/src/lib/unichar.c
--- a/src/lib/unichar.c
+++ b/src/lib/unichar.c
@@ -287,7 +287,7 @@
 
 static bool uni_ucs4_decompose_multi_utf8(unichar_t chr, buffer_t *output)
 {
-   const uint16_t *value;
+   const uint32_t *value;
unsigned int idx;
 
if (chr < multidecomp_keys[0] || chr > 0x)
diff --git a/src/lib/unicodemap.pl b/src/lib/unicodemap.pl
--- a/src/lib/unicodemap.pl
+++ b/src/lib/unicodemap.pl
@@ -30,14 +30,14 @@
   push @titlecase32_keys, $code;
   push @titlecase32_values, $value;
 }
-  } elsif ($decomp =~ /\<[^>]*> (.+)/) {
+  } elsif ($decomp =~ /(?:\<[^>]*> )?(.+)/) {
 # decompositions
 my $decomp_codes = $1;
 if ($decomp_codes =~ /^([0-9A-Z]*)$/i) {
   # unicharacter decomposition. use separate lists for this
   my $value = eval("0x$1");
-  if ($value > 0x) {
-   print STDERR "Error: We've assumed decomposition codes are max. 
16bit\n";
+  if ($value > 0x) {
+   print STDERR "Error: We've assumed decomposition codes are max. 
32bit\n";
exit 1;
   }
   if ($code <= 0xff) {
@@ -61,8 +61,8 @@
 
   foreach my $dcode (split(" ", $decomp_codes)) {
my $value = eval("0x$dcode");
-   if ($value > 0x) {
- print STDERR "Error: We've assumed decomposition codes are max. 
16bit\n";
+   if ($value > 0x) {
+ print STDERR "Error: We've assumed decomposition codes are max. 
32bit\n";
  exit 1;
}
push @multidecomp_values, $value;
@@ -78,7 +78,7 @@
   my $last = $#list;
   my $n = 0;
   foreach my $key (@list) {
-printf("0x%04x", $key);
+printf("0x%05x", $key);
 last if ($n == $last);
 print ",";
 
@@ -137,7 +137,7 @@
 print_list(\@uni16_decomp_keys);
 print "\n};\n";
 
-print "static const uint16_t uni16_decomp_values[] = {\n\t";
+print "static const uint32_t uni16_decomp_values[] = {\n\t";
 print_list(\@uni16_decomp_values);
 print "\n};\n";
 
@@ -145,7 +145,7 @@
 print_list(\@uni32_decomp_keys);
 print "\n};\n";
 
-print "static const uint16_t uni32_decomp_values[] = {\n\t";
+print "static const uint32_t uni32_decomp_values[] = {\n\t";
 print_list(\@uni32_decomp_values);
 print "\n};\n";
 
@@ -157,6 +157,6 @@
 print_list(\@multidecomp_offsets);
 print "\n};\n";
 
-print "static const uint16_t multidecomp_values[] = {\n\t";
+print "static const uint32_t multidecomp_values[] = {\n\t";
 print_list(\@multidecomp_values);
 print "\n};\n";


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-10 Thread Florian Zeitz
Am 02.05.2013 17:53, schrieb Timo Sirainen:
> On 25.4.2013, at 16.39, Lutz Preßler  wrote:
> 
>> on a system with dovecot 2.2 I've got a mailbox containing multiple mails
>> from a person called Krüger, but From: header encoded differently.
>> Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
>> that is u and umlaut accent as sperate combined codepoints
>> instead of one ü:
>>
>>  From: =?utf-8?Q?replaced_Kru=CC=88ger?= 
>>
>> Searching within roundcube webmail for "krüger" as sender
>> missis this mails.
>>
>> Roundcube sends (dovecot rawlog):
>> A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
>>
>> Is this supposed to work? Haven't done any more debugging
>> (other search variants) or read RFCs. As a user I would expect
>> Unicode equivalence rules be applied (see 
>> http://en.wikipedia.org/wiki/Unicode_equivalence)
> 
> IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
> Then again, others could be supported as well, and it's not really a 
> requirement that the search can't handle more flexible searches.. Anyway, 
> that's what Dovecot currently has implemented, and I guess it doesn't do what 
> you want it to do. But there is a partial solution for this:
> 
> http://dovecot.org/patches/2.1/icu-1.2.tar.gz
> 
> It probably does what you want, but it only works with fts-lucene.
> 
Could you elaborate a bit why you think i;unicode-casemap does not
handle this case?

Is it only applied to the query, but not the header, or vice versa?
It seems to me that Step 2 should map both inputs to LATIN CAPITAL
LETTER U + COMBINING DIAERESIS.

Regards,
Florian


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-10 Thread Lutz Preßler
Hello Timo,
On Thu, 02 May 2013, Timo Sirainen wrote:

> IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
> Then again, others could be supported as well, and it's not really a 
> requirement that the search can't handle more flexible searches.. Anyway, 
> that's what Dovecot currently has implemented, and I guess it doesn't do what 
> you want it to do. But there is a partial solution for this:
> 
> http://dovecot.org/patches/2.1/icu-1.2.tar.gz
> 
> It probably does what you want, but it only works with fts-lucene.
I'm trying to test it with the 2.2.1 installation, but have a problem
doing so: after seemingly smooth compilation and installation, I get

May 10 14:15:18 host dovecot: imap: Error: Module is for different ABI version 
2.2.1 (we have 2.2.ABIv0(2.2.1)): /usr/lib/dovecot/modules/lib20_icu_plugin.so
May 10 14:15:18 host dovecot: imap: Fatal: Couldn't load required plugins

Any idea?

Greetings,
  Lutz


Re: [Dovecot] search and UTF-8 normalization forms (NFD)

2013-05-02 Thread Timo Sirainen
On 25.4.2013, at 16.39, Lutz Preßler  wrote:

> on a system with dovecot 2.2 I've got a mailbox containing multiple mails
> from a person called Krüger, but From: header encoded differently.
> Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
> that is u and umlaut accent as sperate combined codepoints
> instead of one ü:
> 
>  From: =?utf-8?Q?replaced_Kru=CC=88ger?= 
> 
> Searching within roundcube webmail for "krüger" as sender
> missis this mails.
> 
> Roundcube sends (dovecot rawlog):
> A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
> 
> Is this supposed to work? Haven't done any more debugging
> (other search variants) or read RFCs. As a user I would expect
> Unicode equivalence rules be applied (see 
> http://en.wikipedia.org/wiki/Unicode_equivalence)

IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. 
Then again, others could be supported as well, and it's not really a 
requirement that the search can't handle more flexible searches.. Anyway, 
that's what Dovecot currently has implemented, and I guess it doesn't do what 
you want it to do. But there is a partial solution for this:

http://dovecot.org/patches/2.1/icu-1.2.tar.gz

It probably does what you want, but it only works with fts-lucene.



[Dovecot] search and UTF-8 normalization forms (NFD)

2013-04-25 Thread Lutz Preßler
Hello,

on a system with dovecot 2.2 I've got a mailbox containing multiple mails
from a person called Krüger, but From: header encoded differently.
Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
that is u and umlaut accent as sperate combined codepoints
instead of one ü:

  From: =?utf-8?Q?replaced_Kru=CC=88ger?= 

Searching within roundcube webmail for "krüger" as sender
missis this mails.

Roundcube sends (dovecot rawlog):
A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

Is this supposed to work? Haven't done any more debugging
(other search variants) or read RFCs. As a user I would expect
Unicode equivalence rules be applied (see 
http://en.wikipedia.org/wiki/Unicode_equivalence)

Regards,
  Lutz