--- Comment #10 from David Cook <> ---
So I think I've tracked down the C code behind Text::Unaccent:

The only reference I see to "damma" is in the U+FE70...U+FEFF code point range
which appears to list isolated forms which is not what we're dealing with in
these examples.

While I haven't reviewed the code extensively, it looks like the tables used
for Text::Unaccent are lacking...

If you replace the following line in Galen's script:

use Text::Unaccent qw//;


use Text::Unaccent qw/unac_debug/;

You'll get more details of how Text::Unaccent is working (or not working as it

Here's the output I get for the Arabic:

unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent           - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©
Strip NonspacingMark     - مُدَرِّسَة => مدرسة

Here's the output I get for the Greek:
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Î�α Î�ε Î�η Î�ι Î�ο
Υ� Ω�
Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Interestingly we can see a reference to the Greek cpaital letter alpha with the
tonos diacritic: 


Indeed, in the output, we can see that 0x0386 was changed to 0x0391... although
admittedly I don't know exactly how. It looks like a binary operation that uses
a bitmask to produce a certain value... we don't need to know 100% how that
mechanism is working right now... just that it works as described above.


So in the Arabic example... everything was "untouched" and yet the output is
garbled. That's certainly an encoding issue... 

Indeed, look at the following:

dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8

That is the same output as Text::Unaccent:

Text::Unaccent           - مُدَرِّسَة => Ù�Ù�دÙ�رÙ�Ù�سÙ�Ø©

So somewhere along the line that UTF-8 string is getting double-encoded.

Check this out:

dcook@koha:~/experiments> echo "مُدَرِّسَة" | iconv -f latin1 -t utf-8 | iconv 
utf-8 -t latin1

I think the double-encoding is down to us using "binmode STDOUT, ':utf8';"
(which tells Perl to output UTF-8 encoded bytes instead of Latin-1 (or some
other single byte encoding it normally uses) and "use utf8" which tells Perl
that the source code uses UTF-8...

Removing those gets us the following:

Text::Unaccent           - été => ete

Strip NonspacingMark     - été => A▒tA▒
Text::Unaccent           - umlaüt => umlaut
Wide character in print at line 47.
Strip NonspacingMark     - umlaüt => umlaA1⁄4t
Text::Unaccent           - עברית => עברית

Strip NonspacingMark     - עברית => עב▒ י▒a
Text::Unaccent           - חוֹלָם => חוֹלָם
Strip NonspacingMark     - חוֹלָם => חוO1לO ם
Text::Unaccent           - 北京市 => 北京市

Strip NonspacingMark     - 北京市 => a▒▒ao▒a ▒
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I  I▒I▒
I▒I▒ I▒I▒
Text::Unaccent           - مُدَرِّسَة => مُدَرِّسَة

Strip NonspacingMark     - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة

At a glance, Text::Unaccent looks like it works for French, German, and
Greek... but doesn't touch Hebrew, Japanese(?), or Arabic.

Here's that output again with the debugging:

unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
unac.c:13708: unac_data3[10] & unac_positions[3][10]: 0x00e9 => 0x0065
Text::Unaccent           - été => ete

Strip NonspacingMark     - été => A▒tA▒
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x0075 => untouched
unac.c:13708: unac_data0[13] & unac_positions[0][14]: 0x006d => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x006c => untouched
unac.c:13708: unac_data0[1] & unac_positions[0][2]: 0x0061 => untouched
unac.c:13708: unac_data3[29] & unac_positions[3][29]: 0x00fc => 0x0075
unac.c:13708: unac_data0[20] & unac_positions[0][21]: 0x0074 => untouched
Text::Unaccent           - umlaüt => umlaut
Wide character in print at line 47.
Strip NonspacingMark     - umlaüt => umlaA1⁄4t
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x05e2 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x05d1 => untouched
unac.c:13708: unac_data0[8] & unac_positions[0][9]: 0x05e8 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05d9 => untouched
unac.c:13708: unac_data0[10] & unac_positions[0][11]: 0x05ea => untouched
Text::Unaccent           - עברית => עברית

Strip NonspacingMark     - עברית => עב▒ י▒a
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x05d7 => untouched
unac.c:13708: unac_data0[21] & unac_positions[0][22]: 0x05d5 => untouched
unac.c:13708: unac_data0[25] & unac_positions[0][26]: 0x05b9 => untouched
unac.c:13708: unac_data0[28] & unac_positions[0][29]: 0x05dc => untouched
unac.c:13708: unac_data0[24] & unac_positions[0][25]: 0x05b8 => untouched
unac.c:13708: unac_data0[29] & unac_positions[0][30]: 0x05dd => untouched
Text::Unaccent           - חוֹלָם => חוֹלָם
Strip NonspacingMark     - חוֹלָם => חוO1לO ם
unac.c:13708: unac_data0[23] & unac_positions[0][24]: 0x5317 => untouched
unac.c:13708: unac_data0[12] & unac_positions[0][13]: 0x4eac => untouched
unac.c:13708: unac_data0[2] & unac_positions[0][3]: 0x5e02 => untouched
Text::Unaccent           - 北京市 => 北京市

Strip NonspacingMark     - 北京市 => a▒▒ao▒a ▒
unac.c:13708: unac_data21[6] & unac_positions[21][7]: 0x0386 => 0x0391
unac.c:13708: unac_data22[12] & unac_positions[22][13]: 0x03ac => 0x03b1
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[8] & unac_positions[21][9]: 0x0388 => 0x0395
unac.c:13708: unac_data22[13] & unac_positions[22][14]: 0x03ad => 0x03b5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[9] & unac_positions[21][10]: 0x0389 => 0x0397
unac.c:13708: unac_data22[14] & unac_positions[22][15]: 0x03ae => 0x03b7
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[10] & unac_positions[21][11]: 0x038a => 0x0399
unac.c:13708: unac_data22[15] & unac_positions[22][16]: 0x03af => 0x03b9
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[12] & unac_positions[21][13]: 0x038c => 0x039f
unac.c:13708: unac_data23[12] & unac_positions[23][13]: 0x03cc => 0x03bf
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[14] & unac_positions[21][15]: 0x038e => 0x03a5
unac.c:13708: unac_data23[13] & unac_positions[23][14]: 0x03cd => 0x03c5
unac.c:13708: unac_data0[0] & unac_positions[0][1]: 0x0020 => untouched
unac.c:13708: unac_data21[15] & unac_positions[21][16]: 0x038f => 0x03a9
unac.c:13708: unac_data23[14] & unac_positions[23][15]: 0x03ce => 0x03c9
Text::Unaccent           - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => Αα Εε Ηη Ιι Οο Υυ Ωω

Strip NonspacingMark     - Άά Έέ Ήή Ίί Όό Ύύ Ώώ => I▒I▒ I▒I▒ I▒I▒ I▒I  I▒I▒
I▒I▒ I▒I▒
unac.c:13708: unac_data0[5] & unac_positions[0][6]: 0x0645 => untouched
unac.c:13708: unac_data0[15] & unac_positions[0][16]: 0x064f => untouched
unac.c:13708: unac_data34[15] & unac_positions[34][16]: 0x062f => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[17] & unac_positions[34][18]: 0x0631 => untouched
unac.c:13708: unac_data0[16] & unac_positions[0][17]: 0x0650 => untouched
unac.c:13708: unac_data0[17] & unac_positions[0][18]: 0x0651 => untouched
unac.c:13708: unac_data34[19] & unac_positions[34][20]: 0x0633 => untouched
unac.c:13708: unac_data0[14] & unac_positions[0][15]: 0x064e => untouched
unac.c:13708: unac_data34[9] & unac_positions[34][10]: 0x0629 => untouched
Text::Unaccent           - مُدَرِّسَة => مُدَرِّسَة

Strip NonspacingMark     - مُدَرِّسَة => U▒U▒▒ U▒رU▒U▒▒3U▒ة

You are receiving this mail because:
You are watching all bug changes.
Koha-bugs mailing list
website :
git :
bugs :

Reply via email to