http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=14759
--- Comment #10 from David Cook <dc...@prosentient.com.au> --- So I think I've tracked down the C code behind Text::Unaccent: https://github.com/gitpan/Text-Unaccent/blob/master/unac.c The only reference I see to "damma" is in the U+FE70...U+FEFF code point range which appears to list isolated forms which is not what we're dealing with in these examples. While I haven't reviewed the code extensively, it looks like the tables used for Text::Unaccent are lacking... If you replace the following line in Galen's script: use Text::Unaccent qw//; with use Text::Unaccent qw/unac_debug/; unac_debug($Text::Unaccent::DEBUG_HIGH); You'll get more details of how Text::Unaccent is working (or not working as it were). Here's the output I get for the Arabic: unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© Strip NonspacingMark - مُدَرِّسَة => مدرسة Here's the output I get for the Greek: unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391 unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395 unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397 unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399 unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5 unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9 unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9 Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Î�α Î�ε Î�η Î�ι Î�ο Î¥Ï� ΩÏ� Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω Interestingly we can see a reference to the Greek cpaital letter alpha with the tonos diacritic: * 0386 GREEK CAPITAL LETTER ALPHA WITH TONOS * 0391 GREEK CAPITAL LETTER ALPHA Indeed, in the output, we can see that 0x0386 was changed to 0x0391... although admittedly I don't know exactly how. It looks like a binary operation that uses a bitmask to produce a certain value... we don't need to know 100% how that mechanism is working right now... just that it works as described above. -- So in the Arabic example... everything was "untouched" and yet the output is garbled. That's certainly an encoding issue... Indeed, look at the following: dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© That is the same output as Text::Unaccent: Text::Unaccent - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø© So somewhere along the line that UTF-8 string is getting double-encoded. Check this out: dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 | iconv -f utf-8 -t latin1 مُدَرِّسَة I think the double-encoding is down to us using "binmode STDOUT, ':utf8';" (which tells Perl to output UTF-8 encoded bytes instead of Latin-1 (or some other single byte encoding it normally uses) and "use utf8" which tells Perl that the source code uses UTF-8... Removing those gets us the following: Text::Unaccent - été => ete Strip NonspacingMark - été => A▒tA▒ Text::Unaccent - umlaüt => umlaut Wide character in print at unaccent.pl line 47. Strip NonspacingMark - umlaüt => umlaA1⁄4t Text::Unaccent - עברית => עברית Strip NonspacingMark - עברית => עב▒ י▒a Text::Unaccent - חוֹלָם => חוֹלָם Strip NonspacingMark - חוֹלָם => חוO1לO ם Text::Unaccent - 北京市 => 北京市 Strip NonspacingMark - 北京市 => a▒▒ao▒a ▒ Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I I▒I▒ I▒I▒ I▒I▒ Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة Strip NonspacingMark - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة At a glance, Text::Unaccent looks like it works for French, German, and Greek... but doesn't touch Hebrew, Japanese(?), or Arabic. Here's that output again with the debugging: unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065 unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065 Text::Unaccent - été => ete Strip NonspacingMark - été => A▒tA▒ unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x0075 => untouched unac.c:13708: unac_data0[13] & unac_positions[0][14]: 0x006d => untouched unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x006c => untouched unac.c:13708: unac_data0[1] & unac_positions[0][2]: 0x0061 => untouched unac.c:13708: unac_data3[29] & unac_positions[3][29]: 0x00fc => 0x0075 unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched Text::Unaccent - umlaüt => umlaut Wide character in print at unaccent.pl line 47. Strip NonspacingMark - umlaüt => umlaA1⁄4t unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x05e2 => untouched unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x05d1 => untouched unac.c:13708: unac_data0[8] & unac_positions[0][9]: 0x05e8 => untouched unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05d9 => untouched unac.c:13708: unac_data0[10] & unac_positions[0][11]: 0x05ea => untouched Text::Unaccent - עברית => עברית Strip NonspacingMark - עברית => עב▒ י▒a unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x05d7 => untouched unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x05d5 => untouched unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05b9 => untouched unac.c:13708: unac_data0[28] & unac_positions[0][29]: 0x05dc => untouched unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched unac.c:13708: unac_data0[29] & unac_positions[0][30]: 0x05dd => untouched Text::Unaccent - חוֹלָם => חוֹלָם Strip NonspacingMark - חוֹלָם => חוO1לO ם unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x5317 => untouched unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x4eac => untouched unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x5e02 => untouched Text::Unaccent - 北京市 => 北京市 Strip NonspacingMark - 北京市 => a▒▒ao▒a ▒ unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391 unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395 unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397 unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399 unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5 unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5 unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9 unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9 Text::Unaccent - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω Strip NonspacingMark - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I I▒I▒ I▒I▒ I▒I▒ unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched Text::Unaccent - مُدَرِّسَة => مُدَرِّسَة Strip NonspacingMark - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list Koha-bugs@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/