Re: [Dovecot] search and UTF-8 normalization forms (NFD)
On 21.5.2013, at 14.41, Lutz Preßler wrote: > On Mi, 15 Mai 2013, Timo Sirainen wrote: > >> On 11.5.2013, at 18.13, Florian Zeitz wrote: >>> So... I had a look at this. Turns out that the current implementation of >>> Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is >>> broken. It only handles decomposition properties that include a tag. >>> I've attached a hg export that fixes this. >> >> Thanks, added to v2.1 and v2.2 hg. >> > Thanks, but there seems to be still a problem left. Sender search > yields all Krüger mails without fts_lucene. But with fts_lucene > enabled - and files in lucene-indexes/ existing - it's not. > (If I delete the lucene-index files and search for sender, > result is correct - but only until they are recreated.) Fixed finally: http://hg.dovecot.org/dovecot-2.2/rev/7e54af474ea4 Add plugin { fts_lucene = normalize no_snowball } setting (NOTE: this change causes all the existing lucene indexes to be rebuilt). This fts-lucene is getting rather annoying. I wonder if all of this is somehow magically solved in Solr.
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
On Mi, 15 Mai 2013, Timo Sirainen wrote: > On 11.5.2013, at 18.13, Florian Zeitz wrote: > > So... I had a look at this. Turns out that the current implementation of > > Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is > > broken. It only handles decomposition properties that include a tag. > > I've attached a hg export that fixes this. > > Thanks, added to v2.1 and v2.2 hg. > Thanks, but there seems to be still a problem left. Sender search yields all Krüger mails without fts_lucene. But with fts_lucene enabled - and files in lucene-indexes/ existing - it's not. (If I delete the lucene-index files and search for sender, result is correct - but only until they are recreated.) Lutz
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
On 11.5.2013, at 18.13, Florian Zeitz wrote: > Am 10.05.2013 15:24, schrieb Florian Zeitz: >> Could you elaborate a bit why you think i;unicode-casemap does not >> handle this case? >> >> Is it only applied to the query, but not the header, or vice versa? >> It seems to me that Step 2 should map both inputs to LATIN CAPITAL >> LETTER U + COMBINING DIAERESIS. >> >> Regards, >> Florian >> > > So... I had a look at this. Turns out that the current implementation of > Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is > broken. It only handles decomposition properties that include a tag. > I've attached a hg export that fixes this. Thanks, added to v2.1 and v2.2 hg.
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
Am 10.05.2013 15:24, schrieb Florian Zeitz: > Could you elaborate a bit why you think i;unicode-casemap does not > handle this case? > > Is it only applied to the query, but not the header, or vice versa? > It seems to me that Step 2 should map both inputs to LATIN CAPITAL > LETTER U + COMBINING DIAERESIS. > > Regards, > Florian > So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this. # HG changeset patch # User Florian Zeitz # Date 1368284892 -7200 # Sat May 11 17:08:12 2013 +0200 # Node ID 91f175781d9b75f1617ca5ba50dd58860ef0ae13 # Parent 62874b472dc6e5c30fe7fbc64c1bf868e08bf482 liblib: Fix Unicode decomposition diff --git a/src/lib/test-unichar.c b/src/lib/test-unichar.c --- a/src/lib/test-unichar.c +++ b/src/lib/test-unichar.c @@ -2,11 +2,15 @@ #include "test-lib.h" #include "str.h" +#include "buffer.h" #include "unichar.h" void test_unichar(void) { - static const char *overlong_utf8 = "\xf8\x80\x95\x81\xa1"; + static const char overlong_utf8[] = "\xf8\x80\x95\x81\xa1"; + static const char collate_in[] = "\xc3\xbc \xc2\xb3"; + static const char collate_exp[] = "U\xcc\x88 3"; + buffer_t *collate_out; unichar_t chr, chr2; string_t *str = t_str_new(16); @@ -18,6 +22,13 @@ test_assert(uni_utf8_get_char(str_c(str), &chr2) > 0); test_assert(chr2 == chr); } + + collate_out = buffer_create_dynamic(default_pool, 32); + uni_utf8_to_decomposed_titlecase(collate_in, sizeof(collate_in), +collate_out); + test_assert(!strcmp(collate_out->data, collate_exp)); + buffer_free(&collate_out); + test_assert(!uni_utf8_str_is_valid(overlong_utf8)); test_assert(uni_utf8_get_char(overlong_utf8, &chr2) < 0); test_end(); diff --git a/src/lib/unichar.c b/src/lib/unichar.c --- a/src/lib/unichar.c +++ b/src/lib/unichar.c @@ -287,7 +287,7 @@ static bool uni_ucs4_decompose_multi_utf8(unichar_t chr, buffer_t *output) { - const uint16_t *value; + const uint32_t *value; unsigned int idx; if (chr < multidecomp_keys[0] || chr > 0x) diff --git a/src/lib/unicodemap.pl b/src/lib/unicodemap.pl --- a/src/lib/unicodemap.pl +++ b/src/lib/unicodemap.pl @@ -30,14 +30,14 @@ push @titlecase32_keys, $code; push @titlecase32_values, $value; } - } elsif ($decomp =~ /\<[^>]*> (.+)/) { + } elsif ($decomp =~ /(?:\<[^>]*> )?(.+)/) { # decompositions my $decomp_codes = $1; if ($decomp_codes =~ /^([0-9A-Z]*)$/i) { # unicharacter decomposition. use separate lists for this my $value = eval("0x$1"); - if ($value > 0x) { - print STDERR "Error: We've assumed decomposition codes are max. 16bit\n"; + if ($value > 0x) { + print STDERR "Error: We've assumed decomposition codes are max. 32bit\n"; exit 1; } if ($code <= 0xff) { @@ -61,8 +61,8 @@ foreach my $dcode (split(" ", $decomp_codes)) { my $value = eval("0x$dcode"); - if ($value > 0x) { - print STDERR "Error: We've assumed decomposition codes are max. 16bit\n"; + if ($value > 0x) { + print STDERR "Error: We've assumed decomposition codes are max. 32bit\n"; exit 1; } push @multidecomp_values, $value; @@ -78,7 +78,7 @@ my $last = $#list; my $n = 0; foreach my $key (@list) { -printf("0x%04x", $key); +printf("0x%05x", $key); last if ($n == $last); print ","; @@ -137,7 +137,7 @@ print_list(\@uni16_decomp_keys); print "\n};\n"; -print "static const uint16_t uni16_decomp_values[] = {\n\t"; +print "static const uint32_t uni16_decomp_values[] = {\n\t"; print_list(\@uni16_decomp_values); print "\n};\n"; @@ -145,7 +145,7 @@ print_list(\@uni32_decomp_keys); print "\n};\n"; -print "static const uint16_t uni32_decomp_values[] = {\n\t"; +print "static const uint32_t uni32_decomp_values[] = {\n\t"; print_list(\@uni32_decomp_values); print "\n};\n"; @@ -157,6 +157,6 @@ print_list(\@multidecomp_offsets); print "\n};\n"; -print "static const uint16_t multidecomp_values[] = {\n\t"; +print "static const uint32_t multidecomp_values[] = {\n\t"; print_list(\@multidecomp_values); print "\n};\n";
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
Am 02.05.2013 17:53, schrieb Timo Sirainen: > On 25.4.2013, at 16.39, Lutz Preßler wrote: > >> on a system with dovecot 2.2 I've got a mailbox containing multiple mails >> from a person called Krüger, but From: header encoded differently. >> Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), >> that is u and umlaut accent as sperate combined codepoints >> instead of one ü: >> >> From: =?utf-8?Q?replaced_Kru=CC=88ger?= >> >> Searching within roundcube webmail for "krüger" as sender >> missis this mails. >> >> Roundcube sends (dovecot rawlog): >> A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger >> >> Is this supposed to work? Haven't done any more debugging >> (other search variants) or read RFCs. As a user I would expect >> Unicode equivalence rules be applied (see >> http://en.wikipedia.org/wiki/Unicode_equivalence) > > IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. > Then again, others could be supported as well, and it's not really a > requirement that the search can't handle more flexible searches.. Anyway, > that's what Dovecot currently has implemented, and I guess it doesn't do what > you want it to do. But there is a partial solution for this: > > http://dovecot.org/patches/2.1/icu-1.2.tar.gz > > It probably does what you want, but it only works with fts-lucene. > Could you elaborate a bit why you think i;unicode-casemap does not handle this case? Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS. Regards, Florian
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
Hello Timo, On Thu, 02 May 2013, Timo Sirainen wrote: > IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. > Then again, others could be supported as well, and it's not really a > requirement that the search can't handle more flexible searches.. Anyway, > that's what Dovecot currently has implemented, and I guess it doesn't do what > you want it to do. But there is a partial solution for this: > > http://dovecot.org/patches/2.1/icu-1.2.tar.gz > > It probably does what you want, but it only works with fts-lucene. I'm trying to test it with the 2.2.1 installation, but have a problem doing so: after seemingly smooth compilation and installation, I get May 10 14:15:18 host dovecot: imap: Error: Module is for different ABI version 2.2.1 (we have 2.2.ABIv0(2.2.1)): /usr/lib/dovecot/modules/lib20_icu_plugin.so May 10 14:15:18 host dovecot: imap: Fatal: Couldn't load required plugins Any idea? Greetings, Lutz
Re: [Dovecot] search and UTF-8 normalization forms (NFD)
On 25.4.2013, at 16.39, Lutz Preßler wrote: > on a system with dovecot 2.2 I've got a mailbox containing multiple mails > from a person called Krüger, but From: header encoded differently. > Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), > that is u and umlaut accent as sperate combined codepoints > instead of one ü: > > From: =?utf-8?Q?replaced_Kru=CC=88ger?= > > Searching within roundcube webmail for "krüger" as sender > missis this mails. > > Roundcube sends (dovecot rawlog): > A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger > > Is this supposed to work? Haven't done any more debugging > (other search variants) or read RFCs. As a user I would expect > Unicode equivalence rules be applied (see > http://en.wikipedia.org/wiki/Unicode_equivalence) IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this: http://dovecot.org/patches/2.1/icu-1.2.tar.gz It probably does what you want, but it only works with fts-lucene.