On Thursday, Jul 24, 2003, at 11:54 Asia/Tokyo, Dan Kogai wrote:
On Wednesday, Jul 23, 2003, at 23:20 Asia/Tokyo, Kino wrote:
3. Terminal showed
MacArabic "\x20" does not map to Unicode. MacArabic "\x21" does not map to Unicode. MacArabic "\x22" does not map to Unicode.
These are easy to fix; Just copy the missing parts from ASCII.
4. In result.txt, those characters, i.e. \x20-\x2F, \x3A-\x3F, \x5B-\x5F, \x7B-\x7D, \x81, \x8C, \x93, \x98, \x9B, \xA0-\xA4, \xA6-\xAB, \xAD-\xBA, \xBC-\xBE, \xC0, \xDB-\xDF, \xFB-\xFD have not been converted to appropriate characters but changed to hexadecimal notation.
I am not sure if I can fix these by simply reapplying ARABIC.TXT because of BIDI issues.
I think you are thinking of double-defined characters like
0x20 <LR>+0x0020 # SPACE, left-right 0xA0 <RL>+0x0020 # SPACE, right-left
0x21 <LR>+0x0021 # EXCLAMATION MARK, left-right 0xA1 <RL>+0x0021 # EXCLAMATION MARK, right-left
0x22 <LR>+0x0022 # QUOTATION MARK, left-right 0xA2 <RL>+0x0022 # QUOTATION MARK, right-left
0x23 <LR>+0x0023 # NUMBER SIGN, left-right 0xA3 <RL>+0x0023 # NUMBER SIGN, right-left
0x24 <LR>+0x0024 # DOLLAR SIGN, left-right 0xA4 <RL>+0x0024 # DOLLAR SIGN, right-left
A solution would be to enclose each sequence of left-to-right characters between U+202D (LEFT-TO-RIGHT OVERRIDE) -- or U+200E (LEFT-TO-RIGHT MARK)? -- and U+202C (POP DIRECTIONAL FORMATTING). But it is obvious that this will work only in a one-way, i.e. conversion *from* MacArabic. In conversion from, for example, Windows Arabic CP 1256 into MacArabic, a kind of contextual analysis would necessary to determine which character to be used, e.g. 0x20 or 0xA0.
It would be great if such intelligent conversion will be implemented. I'm afraid however this would be very difficult since I have never seen such a converter.
Anyway, practically speaking, I think it would be sufficient that all double-defined characters would be treated as right-to-left characters because...
1. I have seldom seen left-to-right characters in text files in Dos/Windows Arabic except those for abbreviation, tags, etc. In really multilingual documents, alas, Word doc format is predominant. But I may be wrong. I'm not familiar with contemporary Arabic.
2. Those who attempt to treat Arabic/Farsi/Hebrew files with perl are supposed to be aware of what they are doing ;-) Though it would be difficult to implement intelligent conversion which works for any document, it would be rather easy to create an post-processing script for a specific file, I think.
Anyway, I will make mac(Arabic|Farsi|Hebrew).ucm available BEFORE releasing the next version of Encode so I appreciate if you test them.
I'll be very happy to test them.
Thanks again.
Kino