On Tuesday, Aug 5, 2003, at 01:48 Asia/Tokyo, Dan Kogai wrote:

On Thursday, Jul 24, 2003, at 14:39 Asia/Tokyo, Kino wrote:
Anyway, I will make mac(Arabic|Farsi|Hebrew).ucm available BEFORE releasing the next version of Encode so I appreciate if you test >> them.

I'll be very happy to test them.

I am sorry I forgot to report this to you but the patch is already in bleedperl (though 1.98 remains unrelased) I have sent the patch to jhi because he was in a hurry (that happened right after RC4). You can get the UCMs via


http://www.dan.co.jp/~dankogai/macbidi-ucm.tar.gz

Thanks. I tested the new ucm files with Encode 1.97 recompiled with them. As macFarsi.ucm is not found in the gz, I also tried perl 5.8.1 rc4 which seems to have a new ucm for MacFarsi.


With 'perl -Mencoding=MacArabic,STDOUT,utf8 -pe1 < /tmp/MacArabic.txt > /tmp/MA_utf8.txt' or something alike, still the following characters were not converted to utf-8 character but replaced with hexadecimal notation.

MacArabic/MacFarsi

\x20-\x24, \x26-\x2B, \x2D-\x2F, \x3A, \x3C-\x3E, \x5B-\x5F, \x7B-\x7D

MacHebrew

\x20-\x25, \x27-\x2F, \x30-\x3F, \x5B, \x5D, \x7B-\x7D

They are characters which have their right-to-left equivalent -- the same code in Unicode -- in \x80-\xFF.

I think these notations should be converted into real characters because many text files in MacAFH would contain portions in English or Other European languages where those left-to-right characters are used. Moreover, in some files, the left-to-right characters are found even in portions in Arabic. This occurs when the file was not created in Mac but converted from a Windows file. For example, space char in Arabic text is not \xA0 but \x20 in

ftp://ftp.cs.tu-berlin.de/pub/mac/misc/Quran-and-trans.sea.1.bin
ftp://ftp.cs.tu-berlin.de/pub/mac/misc/Quran-and-trans.sea.2.bin
(You need both if you examine them. They are split(?) archives.)


And with 'perl -Mencoding= utf8,STDOUT,MacArabic -pe1 < /tmp/MAcharset_utf8.txt > /tmp/MAcharset.txt' or something alike, I noticed the following oddities. I had not tried it when I posted the first message on this thread, sorry.


MacArabic

\xB0-\xB9 (ARABIC-INDIC DIGIT ZERO-NINE) are enclosed between "\x{202e}" and "\x{202c}"

MacFarsi

\xB0-\xB9 (EXTENDED ARABIC-INDIC DIGIT ZERO-NINE) are enclosed between "\x{202e}" and "\x{202c}"

MacHebrew

\xB0-\xB9 (DIGIT ZERO-NINE) are enclosed between "\x{202e}" and "\x{202c}"
"\x{05f2}" instead of \x81
"\x{f86a}" followed by \xEC\xDD instead of \xC0
\xCB followed by "\x{f87f}" instead of \xDE.


"\x{202e}" and "\x{202c}" should be removed since MacArabic/Farsi/Hebrew don't use directional marks.

As to other oddities with Hebrew, perhaps some special treatment would needed since those characters are combined characters. Is this possible? I know almost nothing about how Encode works.


Kino




Reply via email to