RE: Difference between EM QUAD and EM SPACE
At 2:09 AM -0800 7/11/00, Roozbeh Pournader wrote: On Mon, 10 Jul 2000, Jonathan Coxhead wrote: In TeX, the difference is that an EM QUAD (\qquad) and an EN QUAD (\quad) provide spaces that are legitimate breakpoints for lines within a paragraph; while EM SPACE, EN SPACE (\enspace) and THIN SPACE (\thinspace) produce horizontal space that cannot cause a line-break. Very close, except for the size of the quads. I don't think so. I remember thatn in TeX, \quad was an an em quad, and \qquad a double em quad. Would someone look at a good source for that? --roozbeh Correct. Knuth says The macros \enskip, \quad, and \qquad provide spaces that are legitimate breakpoints within a paragraph; \enspace, \thinspace, and \negthinspace produce space that cannot cause a break... \def\enskip{\hskip.5em\relax} \def\quad{\hskip1em\relax} \def\qquad{\hskip2em\relax} \def\enspace{\kern.5em } \def\thinspace{\kern .16667em }... The TeXBook, p. 352. Roughly, then \enskip ~ en space \quad ~ em space \qquad ~ 2em space \enspace~ en kern \thinspace ~ thin kern Not the most enlightening choice of names, but we have that problem as well. Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland
Re: Han character names?
Thomas Chan wrote: I was interested in seeing an example of a Han graph that has no documented pronunciation because I was under the impression that such a graph doesn't/cannot exist. The "beikao" chapter (pp. 1585-1631) of the _Kangxi Zidian_ would be one place to start for those unconfirmable that have pronunciations but no meanings or having neither. e.g., 1585.9 (two U+4E36's, one over the other, and all that overlaid across the leftmost stroke of U+4E43) and 1593.23 (U+5B80 above U+4E43), both of which have no pronunciation/meaning information documented in Morohashi or _Hanyu Da Zidian_ either. Those with the Kodansha reprint of the Biaozhu Dingzheng Kangxi Zidian (ISBN4-06-121033-5), will find the 'beikao' chapter on pages 3533 - 3602, the last chapter in the book. The first example above is the 9th character in the chapter. The second example is the first character under U+5B80 as a classifier (i.e., Kangxi classifier #40, 'sheltered, under a roof, thatch') p. 3545. Thanks for calling attention to the 'beikao' chapter of Kangxi Zidian. Very instructive. I noticed in the first 600 or 700 entries listed in this chapter, that these two characters are the only examples where both the pronunciation and meaning are said to be totally missing. (Both characters seem to have been created at the same time (AD 841 - 846)). But within this same range of entries, there are at least four more cases where both the pronunciation and meaning are said to be 'not yet clear'. If this rate of discovery holds throughout the chapter (there seems to be about 4000 - 5000 entries in all), we would expect to find around 40 characters that either totally lacked documented pronunciation or for which the pronunciation was not clear or was questionable at the time of compilation. Certainly proves that such critters exist. It is also interesting that all the other entries that are not listed as mistakes for proper characters or as allographs, and which must number a couple thousand or more, are listed as having pronunciation but no known meaning. Oh, how rudely out of character for a script persistently characterized as "ideographic"... Jon -- Jon Babcock [EMAIL PROTECTED]
Re: Detecting installed fonts in a browser window [was Re...
Subject: Re: Detecting installed fonts in a browser window [was Re: Tradi - Due to a bug in Arabic-enabled fonts distributed with IE 5, Tahoma, Arabic Traditional, Courier New, etc., the medial form of U+06CC (ARABIC LETTER FARSI YEH) gets rendered exactly like the isolated form. Some comments: Looks to me like the initial form of U+06CC is also wrong. You didn't mention that, but I hope your "font fixer" tool will correct it also. Even Times New Roman and Arial Unicode have this flaw. The bug is only a forgotten field in the GPOS table of the fonts Isn't it actually in the GSUB table? I don't think these fonts have GPOS information Bob Hallissy
Persian developers (was Re: Detecting installed fonts in ...
Subject: Persian developers (was Re: Detecting installed fonts in a browser window) - [EMAIL PROTECTED] said: That has created a major problem for Persian developers trying to maintain a web page. They should check the page for any case of medial form of ARABIC LETTER FARSI YEH, and replace it with ARABIC LETTER YEH, because they look like each other in the medial form. But that also creates a problem when the users uses a local search on the document. This raises a question that I've been wondering about: It has been my impression that many Persion applications use the Arabic YEH code point (Windows character 237, U+064A) for the Farsi Yeh, and then depend on the font to have been modified to show the final and isolate without dots. This, of course, would not be considered "correct Unicode", but it was a way to adapt Arabic software to Farsi needs. Similar hacks, if I may call them that, are typically made with a couple of other characters, namely Teh Marbuta (Windows 201, U+0629) and Kaf (223, U+0643), to get the correct Farsi shapes. With wider Unicode coverage from Microsoft and other vendors (albeit with occasional bugs as you have pointed out), these hacks are no longer necessary. But there is surely a large body of Farsi text already encoded using the hacks. What is the general mood of Persian software industry towards this problem: Are they moving rapidly to Unicode or are they staying with the old? Is a standard mechanism (e.g., import/export filters) being developed for migrating and exchanging the data? I'd appreciate any insight you or others on this list have. Bob Hallissy
Re: Euro character in ISO
On Tue, 11 Jul 2000, Asmus Freytag wrote: The only safe way to encode a Euro in HTML appears to be to use Unicode - e.g. by using 8859-1 together with the numeric character reference (NCR) of #x20AC; euro; is much safer. Netscape 4 doesn't recognize hexadecimal character references. --roozbeh
Re: Detecting installed fonts in a browser window [was Re...
On Wed, 12 Jul 2000, Bob Hallissy wrote: Looks to me like the initial form of U+06CC is also wrong. You didn't mention that, but I hope your "font fixer" tool will correct it also. You're right. It will also do that. I had forgotten that. Isn't it actually in the GSUB table? I don't think these fonts have GPOS information Sorry, I meant to write GSUB.
Re: Persian developers (was Re: Detecting installed fonts in ...
On Wed, 12 Jul 2000, Bob Hallissy wrote: It has been my impression that many Persion applications use the Arabic YEH code point (Windows character 237, U+064A) for the Farsi Yeh, and then depend on the font to have been modified to show the final and isolate without dots. This, of course, would not be considered "correct Unicode", but it was a way to adapt Arabic software to Farsi needs. Similar hacks, if I may call them that, are typically made with a couple of other characters, namely Teh Marbuta (Windows 201, U+0629) and Kaf (223, U+0643), to get the correct Farsi shapes. I've not heard anything about the Teh Marbuta in this regard. But I know about the YEH and KAF used instead of FARSI YEH and KEHEH. The problem with YEH is still there when someone uses the CP1256, since that does not have the FARSI YEH. With wider Unicode coverage from Microsoft and other vendors (albeit with occasional bugs as you have pointed out), these hacks are no longer necessary. But there is surely a large body of Farsi text already encoded using the hacks. What is the general mood of Persian software industry towards this problem: Are they moving rapidly to Unicode or are they staying with the old? Is a standard mechanism (e.g., import/export filters) being developed for migrating and exchanging the data? The volume seems to be Word documents only. Many people are writing convertors to make these OK. We are also among the convertor writers. Also few ones are moving rapidly to Unicode. The Worders want their WYSIWYG. They only want to edit and print their old docs. So they install the old fonts on their newer OS-es, and thing go OK for them.
Euro symbol in HTML (was: Euro character in ISO)
Am 2000-07-11 um 23:30 UCT hat Asmus Freytag geschrieben: The only safe way to encode a Euro in HTML appears to be to use Unicode - e.g. by using 8859-1 together with the numeric character reference (NCR) of #x20AC; This does, however, not work with Netscape 4.x, as these browsers only understand decimal NCRs. Pre-4.7 Netscape browsers do not correctly interpret NCRs abobe 255, if an 8-bit encoding (e. g., Latin- is used, in blatant contrast to the standard, cf. http://www.w3.org/TR/REC-html40/charset.html#h-5.1 (I do not remeber the exact version when this bug has been fixed). Hence, the only safe way to encode the Euro symbol seems to be: - Use the euro; entity, cf. the last line of http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.4.1; This will cause Netscape 4.7 to display "EUR" if the Euro glyph is not available (at least the version on my Unix box does so). The following two ways are safe, if the Euro glyph is available in the fonts specified by the user: - use UTF-8 together with the decimal NCR "#8364;"; - use UTF-8 together with the UTF-8 encoding 'E2 82 AC' (in hex). In all cases, do not forget to declare your HTML source as either HTML 4.0 or HTML 4.01, cf. http://www.w3.org/TR/REC-html40/struct/global.html#h-7.2. Cf. my examples: - http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-1.htm in Latin-1, - http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-9.htm in Latin-9, - http://www.rz.uni-konstanz.de/y2k/test/Euro-UTF.htm in UTF-8. Best wishes, Otto Stolz
Re: Euro character in ISO
Ar 15:30 -0800 2000-07-11, scríobh Asmus Freytag: At 01:25 PM 7/11/00 -0800, Leon Spencer wrote: Has ISO addressed the Euro character? Yes. It's at 0x20AC in ISO/IEC 10646-1. This is not a standard notation. Please use U+20AC or one of the other standard notations to refer to UCS code positions. ME
Re: Han character names?
[EMAIL PROTECTED] écrivit : If they did, would the SIP overflow? But that is what Plane 3 is for. MDIP ("More damn ideographs plane") ?? Yes, let us call it the MDIP. What would that be in French? Prise de tête (That is probably too much Parisian French to get wide acceptance in Canada. But I believe people in France will get it correctly). Antoine
Re: Euro character in ISO
Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Excuse me, but that is not appropriate. The ISO/IEC 8859 series is conformant with ISO/IEC 2022, and protocols which adhere to that standard should not be compromised by what you suggest. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). The problem is that some companies do/did not correctly identify their code pages. The world can live with Latin-1 and CP-1252. It shouldn't have to live with CP-1252 being identified as Latin-1. Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: Han character names?
Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock: But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is another story. First of all, it's a moving target. Isn't it best treated as a font variant of CJK? Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: Euro character in ISO
Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Sorry. It may work for CP1252/iso-8859-1, and CP1254/iso-8859-9, but won't for the others. Since Windows starts with the same letter as Word --or is the reason that they both come from the same company. No! I cannot believe that-- there are a couple of requirements that makes effectively the "other" codepages slighty incompatible, such as the necessary presence for · at position B5 (because this is the character Word uses when you ask it to "display" the spaces, and this is hard-coded in the product). Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). Even if 8859-21 is defined to be exactly the same as some stage of CP1252, and everyone in the standardization community admits this as such, habits are so much entrenched, and love against Microsoft so rare in the Unix world, that you may bet a lot that such a standard will never gain wide acceptance. Furthermore, this is completely unnecessary, as nowadays such a standard exists, and it is used to be called 'charset=windows-1252'... The real problem is that: - Windows browsers/MAs did not know that until 1999 (as it seems) - Windows HTML-tools/MAs are reluctant to add the test for presence of non-Latin1 characters to either tag as iso-8859-1 or windows-1252. Apparently they are too lazy (because they already did such a test for ASCII). Well, I am angry, because probably nowadays browsers do the job correctly. Antoine
Re: correction (was: Not all Arabics are created equal...)
On Wed, 12 Jul 2000, Gregg Reynolds wrote: But in any case, this doesn't change the main point: Persian may be spoken MSD-first, but its written forms are LSD-first. No. Except when adding etc. (just like in English), Persian numbers are written MSD-first. When I (and any other Persian speaker I know) try to write something like "I have 12 books", which is "man 12 ketaab daaram" in Persian, I write it in this fashion: M AM NAM 1 NAM 12 NAM K 12 NAM EK 12 NAM ... MARAAD BAATEK 12 NAM This means that Persian is also written MSD-first. --roozbeh
RE: Euro symbol in HTML (was: Euro character in ISO)
Otto Stolz wrote: Hence, the only safe way to encode the Euro symbol seems to be: - Use the euro; entity This will cause Netscape 4.7 to display "EUR" if the Euro glyph is not available (at least the version on my Unix box does so). The following two ways are safe, if the Euro glyph is available in the fonts specified by the user: - use UTF-8 together with the decimal NCR "#8364;"; - use UTF-8 together with the UTF-8 encoding 'E2 82 AC' (in hex). In all cases, do not forget to declare your HTML source as either HTML 4.0 or HTML 4.01, I can confirm that euro; and #8364; also work with Netscape 4.73 under Windows 95. However, the euro symbol seems to be the exception. In my index of HTML 4 named character entities at: http://www.hclrss.demon.co.uk/demos/ent4_frame.html Netscape 4.73 does not recognise any of the other named character entities that correspond to decimal numbers greater than 255. (With ViewCharacter Set set to Unicode (UTF-8)), and using Arial Unicode MS.) Alan Wood (Documentation Writer / Web Master) Context Limited (Electronic publishers of UK and EU legal and official documents) mailto:[EMAIL PROTECTED] http://www.context.co.uk/ http://www.alanwood.net/ (Unicode, special characters, pesticide names)
ATM light glyphs for Unicode characters?
Anyone know if Adobe's (free) ATM lite http://www.adobe.com/products/atmlight/main.html supports display of glyphs for Unicode characters when these are named according to Adobe's document "Unicode Glyph Names" http://partners.adobe.com/asn/developer/typeforum/unicodegn.html - Chris -- ཿརྨ༼སྦྷྲུ༼རྦྷྱུཧ༼སྙར༼འཾིར།
Re: Euro character in ISO
Robert A. Rosenberg wrote: Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). And because certain companies had (and still have) bugs in their comms products, incorrectly identifying CP1252 data as ISO 8859-1, ISO standards should reject ISO-2022 and populate C1 with graphic characters? I suppose other inconsiderate incompatibilities such as the incorrect encoding of half-pitch kana in ISO-2022-JP is the fault of ISO too? Perhaps those companies that have these major bugs in their software, all of which have been repeatedly pointed out, should fix the probems there. The rest of the industry bends over backwards to accomodate these corrupt data, so a little effort on the part of the guilty would help a lot, and might prevent misguided postings like the above. B=
Proposal to make the unicode list more transparent! (Sender:
Jens Siebert [EMAIL PROTECTED] wrote: However, because of the tremendous amount of mails I would like to suggest splitting the list into various lists, divided by main-topics. These could be sorted by groups of languages, such as CJK(+V) and other groups. Another sector could be technical issues, such as encoding-related mails, statements about programm-code source-samples etc. ! I cannot speak for the list adminstrators, but I am on about four mailing lists, and almost every list gets a request like this from time to time. It seems at first glance to be a worthy goal. The problem is that topics and people naturally stray, and what starts as a discussion about one Unicode-related topic ends up being about a totally different one, or even something completely unrelated to Unicode. Recently a discussion about how Japanese furigana should be encoded in Unicode mutated into a discussion about the history of control codes. This is called "topic drift," and it is not necessarily bad, but it is usually difficult to control and would be much more so if there were separate lists for CJK issues, Arabic issues, font issues, technology (fonts/browsers/terminals), etc. There is already a separate list called "unicore" where members discuss proposals for new characters and scripts and other nuts-and-bolts issues. (BTW, how can I join that list? Is it for Unicode members only?) I put this idea here, because personally I only read unicode-list-mails related to CJK and technical issues. I believe many of you may face the same problem, and would like to receive only certain mails related to specialized topics. The best solution is to scan the "Subject" line of messages and to use your "delete" button on messages you don't care about. I know this sounds flippant every time someone says it, but experience shows it is really the best way. We can help by changing the "Subject" line of a thread to reflect that the underlying topic has changed. -Doug Ewell Fullerton, California
Re: Persian developers (was Re: Detecting installed fonts in ...
The source for the Wiondows codepages, http://www.microsoft.com/globaldev ! This one is up at http://www.microsoft.com/globaldev/reference/sbcs/1256.htm michka - Original Message - From: "Roozbeh Pournader" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, July 12, 2000 6:53 AM Subject: Re: Persian developers (was Re: Detecting installed fonts in ... On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote: One thing they do is use the LATEST cp 1256, which includes the Farsi characters, so the hacks are not needed and then they would not have to move to Unicode, actually. I ran across several localizers who were willing to produce files in three formats: Would you please give me a link to the conversion table from the latest CP1256? The version I saw on the Unicode web site lacks: U+066B ARABIC DECIMAL SEPARATOR U+06A9 ARABIC LETTER KEHEH U+06C0 ARABIC LETTER HEH WITH YEH ABOVE U+06CC ARABIC LETTER FARSI YEH which are needed for Persian. --roozbeh
Re: Han character names?
At 4:27 AM -0800 7/12/00, Michael Everson wrote: Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock: But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is another story. First of all, it's a moving target. Isn't it best treated as a font variant of CJK? That's really an open question. We'd need to get a solid survey of the oracle bone characters and their modern counterparts. One problem is that a significant percentage of the former aren't identified (or even identifiable) with modern characters. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.blueneptune.com/~tseng
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz [EMAIL PROTECTED] wrote: . If you send a code in the 0x80-8x9f range to such a terminal or emulator, it properly treats it as a control code. If it was intended as a graphic character ("smart quote" or somesuch) the result is a fractured screen, sometimes even a frozen session. This is the widely reported compatibility problem between UTF-8 and terminals. I know I read somewhere, possibly on Markus Kuhn's Unicode page, possibly somewhere else, that ISO 2022 codes exist to switch out of "ISO 2022 mode" and into "UTF-8 mode" and to either allow or prevent switching back to 2022. Is there any progress on implementing this so terminals and emulators can live with UTF-8? Maybe Markus can clarify. I would be surprised if there's anything in ISO 2022 about UTF8, except that it does provide a way to switch out of and back into ISO 2022 mode, allowing the use of character sets that do not comply with ISO 2022 and 4873. That's what the designating escape sequences "with standard return" and "without standard return" are for. But that's not quite the same thing. There is no good reason why UTF-8 couldn't be used by (say) a VT320 emulator without switching out of the ISO 2022 regime, except that UTF-8 contains C1 control codes as data. This was discussed here a while back and "the other Markus" showed how a C1-safe form of UTF-8 could have been designed: http://www.mindspring.com/~markus.scherer/utf-8c1.html But, as they say, "it's too late now". Therefore, those of us who want to make use of UTF-8 within the ISO 2022 regime must reverse the layers. First decode the UTF-8, then parse for escape sequences. Of course your emulator can get into awful trouble that way if the data stream isn't really UTF-8. But overall it's not that bad; we can live with it, and in fact have done it this way in practice in our own emulator. - Frank
RE: Proposal to make the unicode list more transparent!
And what about using "on-topic" prefixes? E.g. (CJK), (Indic), (Fonts), (BIDI), etc. This could be a big help for both manual and automatic filtering. The actual "dictionary" of prefixes does not need to be formally defined a priori: its maintenance could be and partially or totally spontaneous (e.g.: one uses a new prefix and, if it is informative, others will use it for next messages on the same topic). _ Marco This is, I think, a good idea. If we informally agreed to a syntax, like "use square brackets for the topic", then people could filter for things like "[CJK]". Actually, I suppose there's no reason to restrict to one subject, a single message about CJK fonts might use "[CJK][fonts]", so really this could be almost a keyword list. Also I think it has been a good practice in the past to change the subject when there is enough drift, BUT keep the previous topic for at least the first changed subject line, to make the transition clear to those only scanning subjects. So perhaps an example subject line with all of the above would be: Subject: [CJK][fonts] Where can I find a good Korean font? (was: Re: [Arabic][fonts] Where can I find Arabic fonts?) Mike
Re: Persian developers (was Re: Detecting installed fonts in
Of these, only U+06A9 exists in the Windows CP1256, as can be demonstrated by using MultiByteToWideChar() API or by reading ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP 1256.TXT Bob Hallissy [As an interesting aside, the WideCharToMultiByte() API maps both U+06CC (FARSI YEH) and U+064A (YEH) to Windows character code 237 (xED). ] From: [EMAIL PROTECTED] AT Internet on 12-07-2000 11:53 To: [EMAIL PROTECTED] AT Internet@Ccmail cc: [EMAIL PROTECTED] AT Internet@Ccmail (bcc: Bob Hallissy/IntlAdmin/WCT) Subject: Re: Persian developers (was Re: Detecting installed fonts in On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote: One thing they do is use the LATEST cp 1256, which includes the Farsi characters, so the hacks are not needed and then they would not have to move to Unicode, actually. I ran across several localizers who were willing to produce files in three formats: Would you please give me a link to the conversion table from the latest CP1256? The version I saw on the Unicode web site lacks: U+066B ARABIC DECIMAL SEPARATOR U+06A9 ARABIC LETTER KEHEH U+06C0 ARABIC LETTER HEH WITH YEH ABOVE U+06CC ARABIC LETTER FARSI YEH which are needed for Persian. --roozbeh
Re: Persian developers (was Re: Detecting installed fonts in ...
On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote: http://www.microsoft.com/globaldev/reference/sbcs/1256.htm That only adds KEHEH. I still lack: U+066B ARABIC DECIMAL SEPARATOR U+06C0 ARABIC LETTER HEH WITH YEH ABOVE U+06CC ARABIC LETTER FARSI YEH --roozbeh
Re: Persian developers (was Re: Detecting installed fonts in ...
On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote: I looked at two of the docs it looks like they were using U+002C for the decimal separator even when they were using Unicode (I do not know how common that choice would be). That's not good for typography. For Persian usages, U+002F (slash) is even better than that. The slash is usually misused for that purpose when the charset lacks the Persian decimal separator. --roozbeh
Re: Persian developers (was Re: Detecting installed fonts in ...
From: "Roozbeh Pournader" [EMAIL PROTECTED] I looked at two of the docs it looks like they were using U+002C for the decimal separator even when they were using Unicode (I do not know how common that choice would be). That's not good for typography. For Persian usages, U+002F (slash) is even better than that. The slash is usually misused for that purpose when the charset lacks the Persian decimal separator. I will forward that onto them (not knowing Farsi, I am only as good as the localizer behind it all in these cases!). michka
RE: Proposal to make the unicode list more transparent!
This is, I think, a good idea. If we informally agreed to a syntax, like "use square brackets for the topic", then people could filter for things like "[CJK]". This might sound silly, but some people still use ISO 646-based displays, in which square brackets show up umlauts, etc. Parentheses are safer. Also note that RFC 822 has included a Keywords: header for just this purpose ever since 1982. Anyway, all attempts to tame mailing lists generally fail so let's not waste too much time on this. After all, the relation of the Subject: line to the body is only one of our problems. Others include inappropriate (or non-) tagging of character sets, Silly-MIME-Enclosure Syndrome, Hideous-Formatting Syndrome, and Profligate-Quoting Syndrome. But at least I don't recall seeing any virus-bearing messages here yet... - Frank :-]
Re: Persian developers (was Re: Detecting installed fonts in ...
The ones that they were having trouble with were U+0649 and U+064A. I looked at two of the docs it looks like they were using U+002C for the decimal separator even when they were using Unicode (I do not know how common that choice would be). michka - Original Message - From: "Roozbeh Pournader" [EMAIL PROTECTED] To: "Michael (michka) Kaplan" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED] Sent: Wednesday, July 12, 2000 9:00 AM Subject: Re: Persian developers (was Re: Detecting installed fonts in ... On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote: http://www.microsoft.com/globaldev/reference/sbcs/1256.htm That only adds KEHEH. I still lack: U+066B ARABIC DECIMAL SEPARATOR U+06C0 ARABIC LETTER HEH WITH YEH ABOVE U+06CC ARABIC LETTER FARSI YEH --roozbeh
Re: Eudora?
On 7/12/00 at 8:42 AM -0800, Mark Davis wrote: By the way, does anyone know if Eudora lets you read and write email with UTF-8? The latest version of Mac Eudora lets you read UTF-8. If I can get my act together, the next version may let you write. I'm not sure what we'll be able to get into Windows for the next version. pr -- Pete Resnick mailto:[EMAIL PROTECTED] Eudora Engineering - QUALCOMM Incorporated
Re: Han character names?
At 12:56 PM 7/11/00 +, [EMAIL PROTECTED] wrote: If you bought a copy of the book, you would have known. I saw 2.0 in the Barnes Noble book store the other evening, but they only had one left and it was a struggle to get to it through the competing crowd... Of course, they were competing to reach the latest Harry Potter... and I did flip through 2.0. It was mostly useless, a picture book with uninteresting pictures. Thanks for the endorsement, John. But... 2.0 is pretty out of date. BN is apparently more devoted to stocking the most recent Harry Potters than to stocking the most recent Unicode Standards. Wonder whethere there's a message there. Now, if there were an on-line version that could be searched and had accompanying fonts close at hand instead of those aggravating PICTS/GIFs/JPEGs scattered about, then it'd be useful. If you have access to Win2K, you might try the Unibook character browser on http://www.unicode.org/unibook It also works with Win9x and NT4.0. In either case, the trick is to make sure the large Asian fonts and Arial Unicode MS are installed. On systems prior to Win2K you can get them via the Office 2000 or IE5 language packs etc. as described in many earlier postings on this list. A./
Re: Euro character in ISO
At 04:27 AM 07/12/2000 -0800, Michael Everson wrote: Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Excuse me, but that is not appropriate. The ISO/IEC 8859 series is conformant with ISO/IEC 2022, and protocols which adhere to that standard should not be compromised by what you suggest. Then when you said you used 8859-21 you'd get CP-1252 and Windows would no longer need to lie (or tell the truth by admitting it is CP-1252). The problem is that some companies do/did not correctly identify their code pages. The world can live with Latin-1 and CP-1252. It shouldn't have to live with CP-1252 being identified as Latin-1. Which is what I am saying when I talk about admitting that you are using CP-1252 not ISO-8859-1 (in your MIME/HTML headers) at least in the case where there are glyphs in the x80-x9F range in use. If a system can claim US-ASCII if no codes in the x80-xFF range appear and ISO-8859-1 otherwise (as many MUAs do), it should have the smarts to claim CP-1252 if in its scan it found a x80-x9F glyph). Michael Everson ** Everson Gunn Teoranta ** http://www.egt.ie 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire
Re: Euro character in ISO
At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote: On Tue, 11 Jul 2000, Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro character in ISO: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Except that would break all the systems that understand that C1 "junk," and a number of systems do so because they are adhering to other ISO standards. If you are going to force someone to change their datastreams to something new, they might as well go to some flavour of Unicode anyways. Who is going to get broken if I say on my MIME header (or HTML) that my CHARSET is (example) ISO-8859-21? You are talking about uses where the computer is talking to a device and needs the C1 range to tell it what to do not another computer (where it is just passing a text stream). The C1 codes are DEVICE CONTROL and have no purpose (except to occupy slots that are better used for extra GLYPHS) in EMAIL or HTML transfer. I am NOT asking for anyone to change their mode of operation - only for ISO-8859-x codes that are designed for transfer of printable data. UNICODE is not a viable option since all we are talking about is the ability to select from a number of 256 codepoint 8-bit tables not go over to UTF-8 or UTF-16 (which would require changes to the program code). Geoffrey "tilting at terminal emulators, err windmills."
Re: Euro character in ISO
On Wed, 12 Jul 2000 10:43:59 -0800, Robert A. Rosenberg wrote: At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote: On Tue, 11 Jul 2000, Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote: There has been an attempt to create a series of 'touched up' 8859 standards. The problem with these is that you get all the issues of character set confusion that abound today with e.g. Windows CP 1252 mistaken for 8895-1 with a vengeance: The problem would go away if the ISO would get their heads out of their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and put the CP125x codes there. Except that would break all the systems that understand that C1 "junk," and a number of systems do so because they are adhering to other ISO standards. If you are going to force someone to change their datastreams to something new, they might as well go to some flavour of Unicode anyways. Who is going to get broken if I say on my MIME header (or HTML) that my CHARSET is (example) ISO-8859-21? We go through this exercise about twice a year. First, let's recognize that ISO is not about to revoke Standards 4873 and 2022, so there's not much point in suggesting it. Second, think of a terminal that complies with these standards. A physical terminal such as a VT320. I am using it to access my mail host in text mode, and I'm reading mail with (say) Unix 'mail'. The terminal does not interpret the MIME headers. It doesn't parse HTML. It implements a very straightforward finite state automaton that implements the ISO 2022 based terminal. Unix 'mail' sends to my terminal the bytes of the message, period. Perhaps you're suggesting the Unix 'mail' should become a translation agent between the character set of the mail and that of the user's terminal? I hope not, since given that practically any character set anybody can dream up is "MIME-compliant" as long as it's tagged, then every mail program must know how to convert from every character set in existence to every other one. Or is it the mail transfer agent? Or both? It's really quite a mess; let's not go out of our way to make it worse. To understand the implications of using 8-bit character sets that contain graphic characters in the C1 area FOR INTERCHANGE, imagine trying to do the same thing to the C0 area. - Frank
Re: Euro character in ISO
On Wed, 12 Jul 2000, Frank da Cruz wrote: Perhaps you're suggesting the Unix 'mail' should become a translation agent between the character set of the mail and that of the user's terminal? I hope not, since given that practically any character set anybody can dream up is "MIME-compliant" as long as it's tagged, then every mail program must know how to convert from every character set in existence to every other one. Yes, it damn well should. And this is easy, as there is a standard Unix function that knows how to do this. (it's called iconv). I'm logged into unix right now: $ iconv bash: iconv: command not found $ How standard can it be? And what about VMS, VMS/CMS, VOS, OS/390, OS/400, Tandem, and all the others? How does the mail client know what character set my terminal has? Anyway, between you and me, there are potentially lots of places where character-set conversion can occur. Your mail client, your MTA, my MTA, my mail client, my Telnet server, my Telnet client, my terminal emulator. Let's think carefully about this before we have random combinations of these clients, agents, and servers stepping on each others' toes. - Frank
(no subject)
I'am trying to create a bilingual and bi-directional (Arabic and English Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. This is targeted at the PalmOS, but should be renderable in XML and/or XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open eBook reader. I already have the HTML files entire of the Qur'an in Arabic and English - though I will have them proof read many times before I distribute the completed eBook. The Arabic pages are coded using the win-1256 (Arabic) codepage in the following manner: HTML DIR=RTL head META content="text/html; charset=windows-1256" http-equiv=Content-Type body p align="right" font face = "Traditional Arabic" font size = "5pt" These pages show up fine (correct font and directionality) when using the IE 5.0 browser, however when I convert them to the PalmOS, the right to left directionality is lost. In order to convert the HTML pages to the OEB eBook format I'm using the MobiPocket Publisher (home page http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file from the HTML files. In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator (running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page http://www.arabicpalm.com/) and Mobipocket Reader software installed. The above setup is being tested on Windows 98 (Arabic Enabled Edition) and Windows 2000 PCs. The prc files created using this method, display the Arabic font on the the emulator's Palm IIIc screen (when using the MobiPocket reader), however the correct direction is not enforced. Please note that Arabic and English text are coded with separate html files. My questions are as follows How can I convert from cp 1256 to unicode, without doing it character by character? Is there software that will do this? Dose the eBook Spec. allow for the nesting of a right to left languages (Arabic) inside of a left to right language (English) on the same page? Does anyone know if APOS is unicode compliant? Any advise or examples would be greatly appreciated, as I have not found any examples on how nest languages (with different text and directionality) with in the Palm doc nor prc formats. Akil Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
Re: Han character names?
Michael Everson wrote: Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock: But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is another story. First of all, it's a moving target. Isn't it best treated as a font variant of CJK? Partly so. But only about 30% of the jiaguwen are unifiable with known modern hanzi. -- Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED] Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
Qur'an Arabic eBook port to PalmOS related MISC.
I'am trying to create a bilingual and bi-directional (Arabic and English Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. This is targeted at the PalmOS, but should be renderable in XML and/or XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open eBook reader. I already have the HTML files entire of the Qur'an in Arabic and English - though I will have them proof read many times before I distribute the completed eBook. The Arabic pages are coded using the win-1256 (Arabic) codepage in the following manner: HTML DIR=RTL head META content="text/html; charset=windows-1256" http-equiv=Content-Type body p align="right" font face = "Traditional Arabic" font size = "5pt" These pages show up fine (correct font and directionality) when using the IE 5.0 browser, however when I convert them to the PalmOS, the right to left directionality is lost. In order to convert the HTML pages to the OEB eBook format I'm using the MobiPocket Publisher (home page http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file from the HTML files. In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator (running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page http://www.arabicpalm.com/) and Mobipocket Reader software installed. The above setup is being tested on Windows 98 (Arabic Enabled Edition) and Windows 2000 PCs. The prc files created using this method, display the Arabic font on the the emulator's Palm IIIc screen (when using the MobiPocket reader), however the correct direction is not enforced. Please note that Arabic and English text are coded with separate html files. My questions are as follows How can I convert from cp 1256 to unicode, without doing it character by character? Is there software that will do this? Dose the eBook Spec. allow for the nesting of a right to left languages (Arabic) inside of a left to right language (English) on the same page? Does anyone know if APOS is unicode compliant? Any advise or examples would be greatly appreciated, as I have not found any examples on how nest languages (with different text and directionality) with in the Palm doc nor prc formats. Akil Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
Re: Euro character in ISO
There are lots of Unixes: http://www.columbia.edu/kermit/unix.html How many of them have an iconv function? rangda 47: man iconv man: no entry for iconv in the manual. rangda 48: cat /etc/motd Welcome to Darwin! rangda 49: well, hmmm... zsh: command not found: well, rangda 50:
Re: Subset of Unicode to represent Japanese Kanji?
I am NOT a unicode expert but I am a Japanese speaker. Here are my 2 cents: Japanese document must consist of: hiragana: less than 100 characters katakana: less than 100 characters kanji: basic kanji has 6,879 characters as defined in JIS X 0208-1990 extended kanji has 6,067 characters as defined in JIS X 0212-1990 The extended kanji are rarely used -- less than 1% of daily newspaper. The MicroSoft's developed Shift-JIS encoding support hiragana, katakana, basic kanji, but not extended kanji. Technically, a Japanese document can be written in all Roman characters, but this is not a true Japanese document. It is very difficult to read and it leads to ambiguity and misunderstanding. It was only used back in the Telex days, when people had no choice. Foster Feng Programmer/analyst MIS Department National Instruments Otto Stolz [EMAIL PROTECTED] on 2000/07/12 07:41:35 To: "Unicode List" [EMAIL PROTECTED] cc: [EMAIL PROTECTED] (bcc: Foster Feng/TYO/NIC) Subject: Re: Subset of Unicode to represent Japanese Kanji? The Japanese I must support is the Kanji form. [...] I cannot support Unicode in its entirety due to memory constraints. If I am not mistaken, Kanji is ideographic characters, which would take the lion's share of memory to implement. Probably, you have to support kana (hiragana or katakana). I do not know Japanese, so others may jump in. Best wishes, Otto Stolz
Miscellaneous comments/questions.
Hi! I just returned from a lengthy trip through parts of Europe and thought I mention some observations. In Greece, I noticed that almost all signs used monotonic Greek. I saw some older road signs and a couple of store signs that used polytonic Greek, but according to a Greek acquaintance, everybody is very happy to not have to deal with it anymore. When did the switch actually happen? He claimed it was only about a decade ago? What was interesting to see was how the printing of the tonos varied. For the most part it did look like a steeper acute as described in Chapter 7.2 of Unicode 3. A number of times, I did see a variation though which looked more like, e.g., U+03B1 U+0307, but I suspect that to be just a font style. I also noticed that frequently, certain characters are written in variants which at first were completely indecipherable to me. I especially recall the beta (U+03D0), theta (U+03D1), and maybe pi (U+03D6) as well as the upper-case upsilon (U+03D2.) As someone who learned classical Greek in school, it added to the problems I already had with the modern pronunciation of a lot of the letters ;-) One thing I found very confusing was the mixing of Latin and Greek script which is very common on billboards. A couple of times I found myself unable to tell whether a word was spelled in Latin or Greek since it only used glyphs which both scripts share and hence I could not derive the proper pronunciation at first. It was interesting to see some brand name products and proper names transcribed while sometimes Latin script is used in mid-sentence for foreign words. A similar issue was very interesting to observe in France and Germany. The use of the English language in advertisement seems to run rampant in Germany while almost all ads that include English in France (mostly tag lines) are followed by an asterisk and the literal French translation somewhere near the edge of the sign. At first I thought it was somewhat silly but when I saw how the German language currently is absorbing English words like a sponge, the footnotes seemed to make sense. While in Germany, I bought a children's book that was first published in 1921 and used a simplified Fraktur. As a native German, I had no problems reading it, but for my wife who doesn't have German as her native language, the long-s did throw her off at first. After I explained the logic behind it, it was a lot easier, but she did make a good point as to why it isn't used in the "sp" digraph. Maybe Otto can shed some light on this? In looking at older Fraktur text, it was very interesting to see how foreign words are set in an Antiqua font similar to how in English text foreign words are often in italics (and similar to the use of Latin script in Greek above.) This brings up a font question I have been wondering about for a long time: How similar are typesetting features of fonts across different scripts? It seems that most European scripts have print and cursive versions (I saw some beautiful cursive signs in Greece), serifs and mono-spaced fonts, and boldness and slant seems to be common as well. But what about other scripts? It seems that all(?) scripts currently represented in Unicode have at least some typographical tradition albeit only scholarly in some cases. How much of the features are overlapping, i.e., how much sense does it make to define a serif font for CJK scripts? What about italics in Arabic? Can there be a font family which covers all the scripts in Unicode and which complies with the local typographic esthetics? I apologize for the glyph-centric nature of the question ;-) Two other topics of discussion that came up in recent weeks were very interesting to me: Time zones and location names. The latter was something I have been curious about myself for a while. It is true that in Germany for example, rarely the state (Bundesland) is indicated when referring to a location. When ambiguity arises, regional names or other landmarks are used to distinguish, sometimes to the point of becoming part of the name. Examples: Hamm (Westfalen) and Frankfurt am Main versus Frankfurt an der Oder. Even more interesting to me though would be the local name of places and I would love to find a World Atlas who first indicates every location's name in the local language and script, then the accepted Latin transliteration, and finally the name in English (or, say, German, if published in Germany.) Are the large publishing houses equipped to produce something like this? Or more importantly, would they use Unicode for it? What about smaller printers (like for business cards?) The other issue that was brought up about time zones is fascinating. A while ago, when I was looking into locale issues, it occurred to me that there really needs to be a comprehensive database of "cultural defaults." For extensive localization, you need to know more than just date format, language, and script (OK, I am oversimplifying the extent of the locale information.)
RE: Euro character in ISO
The trick is HTML4. Since you sent the message in HTML format, the Euro is encoded as numeric character reference. Exchange knows how to decode HTML and generate RTF, depending on what your email client needs. If you had sent plain text, the Euro would have turned into ?. As is the case in the plain text part of the multipart message. This is the case for Outlook Express 5. Older versions of OE treated Windows-1252 and iso-8859-1 the same. Here is the source of the message from my Outlook Express Sent Mail folder. (To see the source, open message and press Ctrl-F3). From: "Chris Wendt" [EMAIL PROTECTED] To: "Chris Wendt" [EMAIL PROTECTED] Subject: Euro test Date: Wed, 12 Jul 2000 15:17:49 -0700 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_NextPart_000_0005_01BFEC14.57202A10" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 This is a multi-part message in MIME format. --=_NextPart_000_0005_01BFEC14.57202A10 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable abcdef ? abcdef --=_NextPart_000_0005_01BFEC14.57202A10 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" HTMLHEAD META content=3D"text/html; charset=3Diso-8859-1" = http-equiv=3DContent-Type META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR STYLE/STYLE /HEAD BODY bgColor=3D#ff DIVFONT color=3D#008000 face=3DVerdana size=3D2abcdef #8364;=20 abcdef/FONT/DIV/BODY/HTML --=_NextPart_000_0005_01BFEC14.57202A10-- -Original Message- From: Leon Spencer [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 12, 2000 2:38 PM To: Unicode List Subject: RE: Euro character in ISO Is Microsoft playing tricks in MS Outlook or IE? If I send text from Outlook Express to my exchange account, with charset set to iso-8859-1 but containing the Trademark symbol ((tm)) in the body, it shows up okay. The body of the message is in text/html. Is it possible that MS Outlook's HTML ActiveX control (which I'm assuming to be the same used for IE) is defaulting to Cp1252/Windows-1252 when it sees iso-8859-1? Leon BTW, The body also contains the Euro!
Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
"Jaap Pranger" [EMAIL PROTECTED] wrote: At 16:44 +0200 2000.07.12, [EMAIL PROTECTED] wrote: Everybody (beginning by myself!) should probably be more careful in naming subject lines, and renaming them when a reply deviates from the subject. Marco, This wil not help very much when you send UTF-8 messages. Your Subject lines in those messages show up completely "garbled", at least in my non-UTF-8-aware email client. OK, that's my problem. But mostly other people's UTF-8 messages show 'neat' Subject headers. What's going on, why this difference? Jaap In Outlook Express under Tools, Options, Send, International Settings it is possible to specify that only English (? ASCII) is used in headers and under Tools, Options, Send, Plain Text Settings Tools, Options, Send, HTML Settings it is possible to specify whether or not 8-bit characters may be used in message headers. These settings seem to apply whatever encoding is used for the body of the message. - Chris
Re: Eudora?
By the way, does anyone know if Eudora lets you read and write email with UTF-8? The latest version of Mac Eudora lets you read UTF-8. If I can get my act together, the next version may let you write. I'm not sure what we'll be able to get into Windows for the next version. Is it default encoding ? What about other Ianaencodings ? Doesit able to produce structuralized text/html or text/xml partin multipart/alternative messages or alone ?
RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
From: Christopher J. Fynn [mailto:[EMAIL PROTECTED]] In Outlook Express under Tools, Options, Send, International Settings it is possible to specify that only English (? ASCII) is used in headers This is relevant when you are running with a non-English OS locale. It will prevent entering non-usascii characters for day and month names in the reply header so as to not force you to send in UTF-8 in case you write in a different script than the OS locale is. and under Tools, Options, Send, Plain Text Settings Tools, Options, Send, HTML Settings it is possible to specify whether or not 8-bit characters may be used in message headers. This does not prevent non-usascii characters in the header. It only decides if the non-usascii characters will be RFC1522 encoded or sent as raw 8-bit bytes - each in the chosen encoding. These settings seem to apply whatever encoding is used for the body of the message. Yes, correct.
RE: correction (was: Not all Arabics are created equal...)
Again: the writing protocol (or algorithm) does not matter. Look at the many ways I can write the number four thousand two hundred fifty seven: The conventional way: 4 42 4257 "Backwards": 7 57 257 4257 "Evens first, forwards": 2 2 7 42 7 4257 "Odds first, backwards": 5 4 5 425 4257 "Evens first, forwards, then odds, backwards": 2 2 7 257 4257 etc. etc. etc. We can run through the same exercise in any language. The outcome is always the same. What counts (no pun intended) is the mathematical rule of evaluation, which says that the LSD position is ones, the next over is tens, then hundreds, etc. In English and most European languages, the MSD, as defined by the mathematical rule of evaluation, comes first in reading order, and "first in reading order" in English means to the left of the other figures. In Arabic, and Persian, and Urdu, etc., "first in reading order" means to the right of the remaining figures, and that means the LSD. "Reading order" means typographically, on the page, and not verbally; don't forget that the figures on the page denote numbers, not words, so pronouncing the words that represent the same number should not be construed as a reading of the figures, but of their meaning. So although Persian written forms are LSD first, the spoken translation is MSD first. The key point is that a mathematical modeling of written language (which is what Unicode amounts to) should model the semantics of written forms, and not the protocols/algorithms of putting ink on paper or emitting sounds into the air. I suspect the audience has become thoroughly bored by now, so if you'd like to continue the conversation maybe we should do so privately. Sincerely, Gregg -Original Message- From: Roozbeh Pournader [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 12, 2000 7:58 AM To: Unicode List Cc: Unicode List Subject: Re: correction (was: Not all Arabics are created equal...) On Wed, 12 Jul 2000, Gregg Reynolds wrote: But in any case, this doesn't change the main point: Persian may be spoken MSD-first, but its written forms are LSD-first. No. Except when adding etc. (just like in English), Persian numbers are written MSD-first. When I (and any other Persian speaker I know) try to write something like "I have 12 books", which is "man 12 ketaab daaram" in Persian, I write it in this fashion: M AM NAM 1 NAM 12 NAM K 12 NAM EK 12 NAM ... MARAAD BAATEK 12 NAM This means that Persian is also written MSD-first. --roozbeh
BTW, Anyone working with MS JVM AND Unicode?
BTW, Anyone working with MS JVM AND Unicode? I'd like to override the core ByteToChar Unicode classes used by the MS JVM. Currently, I'm modifying the TrustedClasspath so my modified sun.io package can be loaded first. Is there someway to get rid of MS JVM's ByteToChar classes all together? Leon