Thank you for the reply.

On Thursday, Jul 24, 2003, at 11:54 Asia/Tokyo, Dan Kogai wrote:

On Wednesday, Jul 23, 2003, at 23:20 Asia/Tokyo, Kino wrote:

3. Terminal showed

MacArabic "\x20" does not map to Unicode.
MacArabic "\x21" does not map to Unicode.
MacArabic "\x22" does not map to Unicode.

These are easy to fix; Just copy the missing parts from ASCII.


4. In result.txt, those characters, i.e. \x20-\x2F, \x3A-\x3F, \x5B-\x5F, \x7B-\x7D, \x81, \x8C, \x93, \x98, \x9B, \xA0-\xA4, \xA6-\xAB, \xAD-\xBA, \xBC-\xBE, \xC0, \xDB-\xDF, \xFB-\xFD have not been converted to appropriate characters but changed to hexadecimal notation.

I am not sure if I can fix these by simply reapplying ARABIC.TXT because of BIDI issues.


I think you are thinking of double-defined characters like

0x20    <LR>+0x0020       # SPACE, left-right
0xA0    <RL>+0x0020       # SPACE, right-left

0x21    <LR>+0x0021       # EXCLAMATION MARK, left-right
0xA1    <RL>+0x0021       # EXCLAMATION MARK, right-left

0x22    <LR>+0x0022       # QUOTATION MARK, left-right
0xA2    <RL>+0x0022       # QUOTATION MARK, right-left

0x23    <LR>+0x0023       # NUMBER SIGN, left-right
0xA3    <RL>+0x0023       # NUMBER SIGN, right-left

0x24    <LR>+0x0024       # DOLLAR SIGN, left-right
0xA4    <RL>+0x0024       # DOLLAR SIGN, right-left

A solution would be to enclose each sequence of left-to-right characters between U+202D (LEFT-TO-RIGHT OVERRIDE) -- or U+200E (LEFT-TO-RIGHT MARK)? -- and U+202C (POP DIRECTIONAL FORMATTING). But it is obvious that this will work only in a one-way, i.e. conversion *from* MacArabic. In conversion from, for example, Windows Arabic CP 1256 into MacArabic, a kind of contextual analysis would necessary to determine which character to be used, e.g. 0x20 or 0xA0.

It would be great if such intelligent conversion will be implemented. I'm afraid however this would be very difficult since I have never seen such a converter.

Anyway, practically speaking, I think it would be sufficient that all double-defined characters would be treated as right-to-left characters because...

1. I have seldom seen left-to-right characters in text files in Dos/Windows Arabic except those for abbreviation, tags, etc. In really multilingual documents, alas, Word doc format is predominant. But I may be wrong. I'm not familiar with contemporary Arabic.

2. Those who attempt to treat Arabic/Farsi/Hebrew files with perl are supposed to be aware of what they are doing ;-) Though it would be difficult to implement intelligent conversion which works for any document, it would be rather easy to create an post-processing script for a specific file, I think.

Anyway, I will make mac(Arabic|Farsi|Hebrew).ucm available BEFORE releasing the next version of Encode so I appreciate if you test them.

I'll be very happy to test them.



Thanks again.



Kino




Reply via email to