Re: Qamats Qatan (was Response to Everson Phoenician and why June 7?)
Jony Rosenne wrote: *Except by Jony, who is always encouraging us to use markup to make distinctions. I don't recall saying anything like this in this Phoenician discussion. Acknowledged. My point was not about that discussion in particular, but about the generic question of to what degree plain-text is a requirement, regardless of what one wants to do within it. Your frequent refrain that distinctions of shape, for what you consider to be the same character (and note that I am not agreeing or disagreeing with any particular judgement), should be handled in 'mark-up' presupposes something other than plain-text in terms of displaying that distinction. You frequently remind us that there are distinctions that are useful to some people, desirable in some circumstances, but which do not constitute a *requirement* in plain-text. Fair enough. For this same reason, I don't automatically accept the argument, made by Michael earlier today, that 'There is a requirement for distinction for X in plain-text'. On what basis do we decide that X is necessary in plain-text while Y should be done with mark-up or some other 'higher level protocol'? John Hudson
RE: Qamats Qatan (was Response to Everson Phoenician and why June 7?)
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson > Sent: Thursday, May 20, 2004 1:08 AM > To: Michael Everson > Cc: [EMAIL PROTECTED] > Subject: Re: Response to Everson Phoenician and why June 7? > > ... > > In discussions of whether to encode individual > characters/glyphs -- and now, it seems, > scripts/styles --, much seems to be made of whether there is > a requirement to make a > distinction in plain-text, while the question of whether > there is a requirement to use > plain-text in the first place gets asked less often.* > > *Except by Jony, who is always encouraging us to use markup > to make distinctions. > I don't recall saying anything like this in this Phoenician discussion. I only say so when I believe it's true and relevant to Hebrew. It's all very nice to desire different shapes for different usages of the same character, but one must also think about the multitude who do not care or know or desire the distinction. > > John Hudson > > >
Re: Response to Everson Phoenician and why June 7?
Ernest Cline wrote: I would be very surprised if there were such a cybercafe. One that had both a Hebrew-Phoenican and a Hebrew-Hebrew font with the Hebrew-Phoenician as the default would be much easier to believe as a possibility. Still, it is a valid point. I think that if Phoenician were to be unified with Hebrew, it would probably behoove Unicode to establish variation sequences for Phoenician. Even with a separate Phoenician script, it might be a good idea to provide variation sequences that could be used to identify different script styles such as Paleo-Hebrew and Punic in the plain text. This is not a practical use of variation sequences if, by this, you mean use of variation selectors. What are you going to do, add a variation selector after every single base character in the text? Are you expecting fonts to support the tiny stylistic variations between Phoenician, Moabite, Palaeo-Hebrew, etc. -- variations that are not even cleanly defined by language usage -- with such sequences? Some people seem keen on variation selectors in the same way that others are keen on PUA: as a catch-all solution to non-existent problems. John Hudson
Re: Response to Everson Phoenician and why June 7?
> [Original Message] > From: John Jenkins <[EMAIL PROTECTED]> > > On May 19, 2004, at 5:07 PM, John Hudson wrote: > > > Michael, can you briefly outline the points regarding this > > 'requirement'? The only one that has been repeatedly referred to in > > this too-long discussion is the Tetragrammaton usage; I'm not sure > > whether that constitutes a requirement for plain-text or not. What are > > the other points? > > > > You go down to your local cybercafe to read your email from your > grandmother telling you all about your nephew's bar-mitzvah. > Unfortunately, your local cybercafe has no modern Hebrew (or Yiddish) > installed, but they *do* have a Phoenician one. You cannot, as a > result, even tell what language your grandmother is writing you in, let > alone what it means. I would be very surprised if there were such a cybercafe. One that had both a Hebrew-Phoenican and a Hebrew-Hebrew font with the Hebrew-Phoenician as the default would be much easier to believe as a possibility. Still, it is a valid point. I think that if Phoenician were to be unified with Hebrew, it would probably behoove Unicode to establish variation sequences for Phoenician. Even with a separate Phoenician script, it might be a good idea to provide variation sequences that could be used to identify different script styles such as Paleo-Hebrew and Punic in the plain text.
Re: ISO 15924 draft fixes
At 03:28 +0200 2004-05-20, Philippe Verdy wrote: It was in the previous list (see the online HTML table 2). What does that refer to? Who decides for the addition of scripts in ISO-15924? The ISO 15924 RA-JAC. I thought there was a separate technical commity and that you were just the bookkeeper of the decisions made by this sub-commitee. With regard to Coptic, and the need to sort out the initial difficulties we are having, it seems prudent that I do what is necessary to correct faults. It is unlikely that the RA-JAC will object to this. It can't be Unicode's UTC alone, as there are already codes for bibliographic references that are not (and will never) be encoded separately in Unicode,so I suppose that there are librarian or publishers members with which you have to discuss, independantly of the work of Unicode, which should only be the registrar for these codes. May be there's still no formal procedure, and for now the codes are maintainable without lots of administration. Read the standard. Do you want a script that generate HTML tables from the reference text file? No. We will handle that in due course. One final note: there's still a missing closing parenthese in a French name << latin (variante brisée >> for the Fraktur script. I think that has been corrected by now. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: ISO 15924 draft fixes
From: "Michael Everson" <[EMAIL PROTECTED]> > >- Where is this line?: > > Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01 > > A new script? Oh, it's in the old file and not in > the new one? It, Coptic, and Phags-pa need to be > in the list (they are all under ballot). It was in the previous list (see the online HTML table 2). Who decides for the addition of scripts in ISO-15924? I thought there was a separate technical commity and that you were just the bookkeeper of the decisions made by this sub-commitee. It can't be Unicode's UTC alone, as there are already codes for bibliographic references that are not (and will never) be encoded separately in Unicode,so I suppose that there are librarian or publishers members with which you have to discuss, independantly of the work of Unicode, which should only be the registrar for these codes. May be there's still no formal procedure, and for now the codes are maintainable without lots of administration. Do you want a script that generate HTML tables from the reference text file? I'm not an expert in Perl, but my knowledge of PHP or "awk" is enough to create it. Or may be a simple Javascript could generate the presentation in browsers. I suggest you use a spreadsheet for now to allow sorting or moving columns. One final note: there's still a missing closing parenthese in a French name << latin (variante brisée >> for the Fraktur script.
Re: problems in Public Review 33
From: Philippe Verdy > Are these permanently assigned non-characters > encodable in any UTF or in CESU-8? I would say they are. While they are not available for transmission of data, they are perfectly legal tor internal use. Indeed, such internal use is the raison d'etre of the block of non characters at FDD0..FDEF An implementation may wish to either allow or disallow the transformation of non-characters depending upon how it uses those codepoints.
Re: problems in Public Review 33 UTF Conversion Code Update
/|/|ike (or |\|\ike) responded to Philippe: > > However I feel it's not legal (or really not recommanded) to encode non- > > character codepoints xFFFE-x where x is any plane number. So the rules > > need to be a bit more detailed to exclude them. > > Why do we need special rules to not encode characters that don't > exist? Please, everybody, before we start another pointless thread, examine the actual definition of UTF-8 and the rationale for an encoding scheme. UTF-8 must be able to represent every Unicode scalar value -- and that *includes* noncharacter code points. D28 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. D29 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. Before you all start shooting from the hip about UTF-8 on the list, please read (and understand) the normative definitions of these things in the standard. --Ken P.S. Whoever (and whatever) is starting to prepend "[BULK]" to thread topics, would you cease and desist? ;-)
Re: Response to Everson Phoenician and why June 7?
On May 19, 2004, at 5:07 PM, John Hudson wrote: Michael, can you briefly outline the points regarding this 'requirement'? The only one that has been repeatedly referred to in this too-long discussion is the Tetragrammaton usage; I'm not sure whether that constitutes a requirement for plain-text or not. What are the other points? You go down to your local cybercafe to read your email from your grandmother telling you all about your nephew's bar-mitzvah. Unfortunately, your local cybercafe has no modern Hebrew (or Yiddish) installed, but they *do* have a Phonecian one. You cannot, as a result, even tell what language your grandmother is writing you in, let alone what it means. Of course, this criterion is difficult to apply to two varieties of writing separated by thousands of years -- and it might behoove the UTC to discuss the problems involved -- but if we accept minimum legibility as a factor in deciding when to unify/separate, I think it's a valid one. John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
RE: [BULK] - Re: problems in Public Review 33 UTF Conversion Code Update
Title: RE: [BULK] - Re: problems in Public Review 33 UTF Conversion Code Update From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Philippe Verdy Sent: Wednesday, May 19, 2004 4:21 PM > However I feel it's not legal (or really not recommanded) to encode non- > character codepoints xFFFE-x where x is any plane number. So the rules > need to be a bit more detailed to exclude them. Why do we need special rules to not encode characters that don't exist? /|/|ike
Is there a better term than metascript for what I am thinking of?
It's not an actual attested English word, but the term "metascript" comes reasonably close to a concept I would like to express in a proposal I am preparing. A "metascript" as I am defining it, is a script such as Latin, Cyrillic or Arabic, that has been extended from a common core in a wide variety of ways to serve the needs of a wide variety of languages. A resulting aspect of metascripts is that they contain far more characters than are needed for any single use of the script. I find the concept useful in explaining why I have made certain decisions in my proposal, but would prefer to use a standard term for the concept if there is one.
Re: ISO 15924 draft fixes
At 01:26 +0200 2004-05-20, Philippe Verdy wrote: I note also that the list of change (the HTML file in your archive) does not include the change of orthograph in English names for consonnants with dots below (such as malalayam). As this ISO-15924 standard should make the English and French names unambiguous, their orthograph is important. I understand that there are many problems with the online files; I made a comparison only with the plain-text files, and Malayalam was not spelled differently in that file, so I judged it irrelevant to the task of correcting the basic database. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: ISO 15924 draft fixes
At 01:08 +0200 2004-05-20, Philippe Verdy wrote: I see some differences - For Georgian, your new file contains only: Georgian (Mkhedruli);Geor;240;géorgien (mkhédrouli);Georgian;2004-05-18 But the previous version also contained in one of the online tables: Georgian (Asomtavruli);Geoa;242;géorgien (assomtavrouli);Georgian;2004-01-05 That's correct. Asomtavruli has been deleted for now. - Where is this line?: Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01 A new script? Oh, it's in the old file and not in the new one? It, Coptic, and Phags-pa need to be in the list (they are all under ballot). Limbu has been adjusted to a more appropriate numeric code within South-Asian scripts (401 to 336). Error corrected. I also think that the removal of duplicate rows for English or French name aliases was a good decision (after all the aliases are already listed between parentheses). No, it would allow a huge number of aliases. People can search the online files with command-F or control-F. I also think that slpitting the line for the start end end codes of private scripts was a good idea. It wasn't mine. I forget whose it was, but it makes the tables print more nicely. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: ISO 15924 draft fixes
I note also that the list of change (the HTML file in your archive) does not include the change of orthograph in English names for consonnants with dots below (such as malalayam). As this ISO-15924 standard should make the English and French names unambiguous, their orthograph is important. - Original Message - From: "Michael Everson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 10:40 PM Subject: ISO 15924 draft fixes > The Registrar wishes to thank everyone who has taken an interest in > the ISO 15924 data pages, and regrets the imperfections which are > contained there. I am not sure how we will manage the generation of > the pages, but it is clear that the base should be the plain-text > document. > > I have made changes to the plain-text document and placed it, a draft > Changes page, and the original plain-text document available at > http://www.unicode.org/iso15924/iso15924-fixes.zip
Re: problems in Public Review 33 UTF Conversion Code Update
From: Frank Yung-Fong Tang wrote: > It should be: > Legal UTF-8 sequences are: > 1st 2nd 3rd 4th Codepoints--- > 00-7F - 007F > C2-DF 80-BF 0080- 07FF > E0 A0-BF 80-BF 0800- 0FFF > E1-EC 80-BF 80-BF 1000- CFFF > ED 80-9F 80-BF D000- D7FF > EE-EF 80-BF 80-BF E000- > F0 90-BF 80-BF 80-BF 1- 3 > F1-F3 80-BF 80-BF 80-BF 4- F > F4 80-8F 80-BF 80-BF 10-10 However I feel it's not legal (or really not recommanded) to encode non-character codepoints xFFFE-x where x is any plane number. So the rules need to be a bit more detailed to exclude them. Are these permanently assigned non-characters encodable in any UTF or in CESU-8?
Re: Response to Everson Phoenician and why June 7?
Michael Everson wrote: There are already encodings suitable for all varieties of Northwest Semitic scripts. One can legitimately argue, as some have, that there are still some problems with the Hebrew and Syriac encodings, but not that we need anything more for the other NW Semitic languages other than some nice FONTS! Which would not address the plain-text requirement to distinguish the scripts qua scripts. Michael, can you briefly outline the points regarding this 'requirement'? The only one that has been repeatedly referred to in this too-long discussion is the Tetragrammaton usage; I'm not sure whether that constitutes a requirement for plain-text or not. What are the other points? In discussions of whether to encode individual characters/glyphs -- and now, it seems, scripts/styles --, much seems to be made of whether there is a requirement to make a distinction in plain-text, while the question of whether there is a requirement to use plain-text in the first place gets asked less often.* *Except by Jony, who is always encouraging us to use markup to make distinctions. John Hudson
Re: ISO 15924 draft fixes
I see some differences - For Georgian, your new file contains only: Georgian (Mkhedruli);Geor;240;géorgien (mkhédrouli);Georgian;2004-05-18 But the previous version also contained in one of the online tables: Georgian (Asomtavruli);Geoa;242;géorgien (assomtavrouli);Georgian;2004-01-05 - Where is this line?: Syloti Nagri;Sylo;316;sylotî nâgrî;;2004-09-01 Limbu has been adjusted to a more appropriate numeric code within South-Asian scripts (401 to 336). I also think that the removal of duplicate rows for English or French name aliases was a good decision (after all the aliases are already listed between parentheses). I also think that slpitting the line for the start end end codes of private scripts was a good idea. - Original Message - From: "Michael Everson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 10:40 PM Subject: ISO 15924 draft fixes > The Registrar wishes to thank everyone who has taken an interest in > the ISO 15924 data pages, and regrets the imperfections which are > contained there. I am not sure how we will manage the generation of > the pages, but it is clear that the base should be the plain-text > document. > > I have made changes to the plain-text document and placed it, a draft > Changes page, and the original plain-text document available at > http://www.unicode.org/iso15924/iso15924-fixes.zip > > I would appreciate it if interested persons could look this over and > inform me if they find any further discrepancies between the two > which are worth troubling about. Then we will proceed to generate the > other files. > > I deleted some duplicate lines: Ethiopic was on two lines, under > Ethiopic and under Ge'ez. It seemed inappropriate to burden the > tables with such duplication. > > I added Coptic unilaterally. > -- > Michael Everson * * Everson Typography * * http://www.evertype.com >
RE: [BULK] - Re: Response to Everson Phoenician and why June 7?
Title: RE: [BULK] - Re: Response to Everson Phoenician and why June 7? > Yer ol' pal, > Youtie The real question here is "what took you so long"? /|/|ike
Response to Everson Phoenician and why June 7?
Elaine Keown Tucson Hi, I include below the response of Prof. Stephen A. Kaufman, one of the world's most famous Aramaists, to the Everson Phoenician proposal: Dr. Stephen A. Kaufman wrote (on the ANE list recently): > Anyone who thinks there has to be a separate > encoding for Phoenician either does not understand > Unicode or (and probably "and") does not understand > what a glyph is. There are already encodings > suitable for all varieties of Northwest Semitic > scripts. One can legitimately argue, as some have, > that there are still some problems with the Hebrew > and Syriac encodings, but not that we need anything > more for the other NW Semitic languages other than >some nice FONTS! > >Steve Kaufman Why did Debbie suggest June 7 as a the latest date for responses? Elaine __ Do you Yahoo!? SBC Yahoo! - Internet access at a great low price. http://promo.yahoo.com/sbc/
Re: Response to Everson Phoenician and why June 7?
I would respecfully suggest that Dr. Stephen A. Kaufman will need to come up with a more convincing or (and probably and) professional argument than this one if he wants it to be taken seriously by people who have a very good understanding of both Unicode and glyphs, and who further have a serious set of requirements that suggest that Dr. Kaufman's needs may be the same as the needs of others who would like the script to be encoded. I doubt neither Dr. Kaufman's expertise nor reputation, but it is clear that the actual stated requirements have not been discussed, nor has any specific problem inherent in the encoding been stated by him. He should consider that if on one side sits convincing arguments and on the other side sits his brief posting that it is unlikely that his words will sway the committee. MichKa [MS] NLS Collation/Locale/Keyboard Development Globalization Infrastructure and Font Technologies - Original Message - From: "E. Keown" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; "Deborah W. Anderson" <[EMAIL PROTECTED]> Cc: "John Cowan" <[EMAIL PROTECTED]> Sent: Wednesday, May 19, 2004 1:54 PM Subject: Response to Everson Phoenician and why June 7? >Elaine Keown >Tucson > > Hi, > > I include below the response of > Prof. Stephen A. Kaufman, one of the world's most > famous Aramaists, to the Everson Phoenician proposal: > > Dr. Stephen A. Kaufman wrote (on the ANE list > recently): > > > Anyone who thinks there has to be a separate > > encoding for Phoenician either does not understand > > Unicode or (and probably "and") does not understand > > what a glyph is. There are already encodings > > suitable for all varieties of Northwest Semitic > > scripts. One can legitimately argue, as some have, > > that there are still some problems with the Hebrew > > and Syriac encodings, but not that we need anything > > more for the other NW Semitic languages other than > >some nice FONTS! > > > >Steve Kaufman > > Why did Debbie suggest June 7 as a the latest date for > responses? > > Elaine > > > > > __ > Do you Yahoo!? > SBC Yahoo! - Internet access at a great low price. > http://promo.yahoo.com/sbc/ > >
RE: Response to Everson Phoenician and why June 7?
Title: RE: Response to Everson Phoenician and why June 7? > > Anyone who thinks there has to be a separate > > encoding for Phoenician either does not understand > > Unicode or (and probably "and") does not understand > > what a glyph is. Was this meant to be a joke? /|/|ike
Re: Response to Everson Phoenician and why June 7?
Golly gee, all this Phoenecianan talk just makes me wanna sing & dance! Yee-Haw! Oh Lord let me flog yet another dead horse I ain't got a life so I love it of course Just hand me a whip and I will be so glad So lord let me flog yet another dead horse! Yer ol' pal, Youtie _ FREE pop-up blocking with the new MSN Toolbar get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
Re: Response to Everson Phoenician and why June 7?
At 13:54 -0700 2004-05-19, E. Keown wrote: I include below the response of Prof. Stephen A. Kaufman, one of the world's most famous Aramaists, to the Everson Phoenician proposal: I had seen his contribution already. > Anyone who thinks there has to be a separate encoding for Phoenician either does not understand Unicode or (and probably "and") does not understand > what a glyph is. I am not in the least bit chastened or chagrined by this. > There are already encodings > suitable for all varieties of Northwest Semitic scripts. One can legitimately argue, as some have, that there are still some problems with the Hebrew and Syriac encodings, but not that we need anything more for the other NW Semitic languages other than > some nice FONTS! Which would not address the plain-text requirement to distinguish the scripts qua scripts. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Response to Everson Phoenician and why June 7?
Elaine asked: > Why did Debbie suggest June 7 as a the latest date for > responses? Probably because that is the deadline for documents to be submitted for consideration at the upcoming UTC meeting. The issue will be discussed there, so anyone who wants to get their input into that meeting should do it soon. Rick
ISO 15924 draft fixes
The Registrar wishes to thank everyone who has taken an interest in the ISO 15924 data pages, and regrets the imperfections which are contained there. I am not sure how we will manage the generation of the pages, but it is clear that the base should be the plain-text document. I have made changes to the plain-text document and placed it, a draft Changes page, and the original plain-text document available at http://www.unicode.org/iso15924/iso15924-fixes.zip I would appreciate it if interested persons could look this over and inform me if they find any further discrepancies between the two which are worth troubling about. Then we will proceed to generate the other files. I deleted some duplicate lines: Ethiopic was on two lines, under Ethiopic and under Ge'ez. It seemed inappropriate to burden the tables with such duplication. I added Coptic unilaterally. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Vertical BIDI
Philippe Verdy recently said: > From: <[EMAIL PROTECTED]> > > What's uncertain is whether a lr or a rl progression is favored, given the > > paucity of evidence. Michael favors lr progression. There is no question > > that the text is read BTT. > This creates an interesting problem: Put in the same sentence Han (Chinese) and > Mongolian words in a vertical layout (I don't think this is unlikely, as > Mongolian is also spoken in China, and there's also a Chinese community in > Mongolia). So Chinese ideographs will be laid out vertically from top to bottom > (but not rotated, except for a few characters like ideographic punctuation marks > or symbols), and Mongolian will be laid out from bottom to top in their normal > stack orientation. Such a text is clearly bidirectional, so we would need BiDi > processing to order glyphs correctly. John's comment refers to Ogham. Mongolian goes top to bottom. > Now try including some Latin words in this text (also not unlikely: there are > lots of trademarks and people names that will need to be written with their > normal Latin characters). If the text is presented vertically, there's a > legitimate question of whever Latin should be rotated (but it will keep the Han > flow direction.) Latin and Cyrillic are rotated 90 degrees clockwise when mixed with Mongolian in vertical lines. Presumably Arabic would be rotated 90 degrees anti-clockwise. (The ancestor of Mongolian was which is why the vertical lines go left to right.) One amusing aspect is that punctuation like ? and ! stay vertical at the end of Mongolian sentances, but are rotated at the end of Latin and Cyrillic ones. Mongolian is somewhat unusual in that nowadays when it is written in horizontal lines, it is rotated a further 90 degrees so it goes left to right and is upside down compared to the ancestral script. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer
Re: ISO 15924 codes for ConScript
On 2004.05.19, 06:23, Doug Ewell <[EMAIL PROTECTED]> wrote: > For those who like ISO 15924 script codes and LOVE the Unicode > Private Use Area -- you know who you are -- check out my list of > proposed ISO 15924 private-use codes for the ConScript Unicode > Registry: > > http://users.adelphia.net/~dewell/conscript-15924.html Great, but wouldn't "Qaas" (918; Seussian Latin Extensions) be rather classified as Latn? --. António MARTINS-Tuválkin | ()| <[EMAIL PROTECTED]>|| PT-1XXX-XXX LISBOA Não me invejo de quem tem| +351 934 821 700 carros, parelhas e montes| http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe| http://pagina.de/bandeiras/ a água em todas as fontes|
problems in Public Review 33 UTF Conversion Code Update
Looking at http://www.unicode.org/review/ 33 UTF Conversion Code Update 2004.06.08 The C language source code example for UTF conversions (ConverUTF.c) has been updated to version 1.2 and is being released for public review and comment. This update includes fixes for several minor bugs. The code can be found at the above link. and look at the code under http://www.unicode.org/Public/BETA/CVTUTF-1-2/ In http://www.unicode.org/Public/BETA/CVTUTF-1-2/ConvertUTF.c /* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. */static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5};although there are code prevent 5-6 bytes UTF-8 sequence. The array above mislead people to think there are 5 and 6 bytes UTF-8. Also, F5-F7 should not map to 3. C0 and C1 ! should not map to 1It should be change to static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0};/* * Once the bits are split out into bytes of UTF-8, this is a mask OR-ed * into the first byte, depending on how many bytes follow. There are * as many entries in this table as there are UTF-8 sequence types. * (I.e., one byte sequence, two byte... six byte sequence.! ) */static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0 xE0, 0xF0, 0xF8, 0xFC }; This comment is also misleading "six byte sequence" and "0xF8, 0xFC" /* Figure out how many bytes the result will require */ if (ch < (UTF32)0x80) { bytesToWrite = 1; } else if (ch < (UTF32)0x800) { bytesToWrite = 2; } else if (ch < (UTF32)0x1) { bytesToWrite = 3; } else if (ch < (UTF32)0x20) { bytesToWrite = 4;Shouldn't the last line be } else if (ch < (UTF32)0x11) { bytesToWrite = 4;? where does the 0x20 come from ? switch (extraBytesToRead) { case 5: ch += *source++; ch <<= 6; case 4: ch += *source++; ch <<= 6;This code also mislead people to think there are 5 and 6 bytes UTF-8 sequenceAlso the following routinestatic Boolean isLegalUTF8(const UTF8 *source, int length) {UTF8 a;const UTF8 *srcptr = source+length;switch (length) {default: return false; /* Everything else falls through when "true"... */case 4: if ((a = (*--srcptr)) < 0x80 || a > 0! xBF) return false;case 3: if ((a = (*--srcptr)) < 0x80 || a > 0xBF) return false;case 2: if ((a = (*--srcptr)) > 0xBF) return false; switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a < 0xA0) return false; break; case 0xF0: if (a < 0x90) return false; break; case 0xF4: if (a > 0x8F) return false; break; default: if (a < 0x80) return false; } case 1: if (*source >= 0x80 && *source < 0xC2) return false;if (*source > 0xF4) return false;}return true;}Does NOT match the table 3.1B as defined in Unicode 3.2see http://www.unicode.org/reports/tr28/#3_1_conformanceor Table 3-6 Well-Formed UTF-8 Byte Sequences in page 78 of Unciode 4.0in particular the function treat the following range legal! whileit should NOTU+D800..U+DFFF ED A0-BF 80-BFAl so http://www.unicode.org/Public/BETA/CVTUTF-1-2/harness.cThe following comment is misleading/* - test01 - Spot check a few legal & illegal UTF-8 values only.This is not an exhaustive test, just a brief one that was used to develop the "isLegalUTF8" routine. Legal UTF-8 sequences are: 1st 2nd 3rd 4th Codepoints--- 00-7F - 007F C2-DF 80-BF 0080- 07FF E0 A0-BF 80-BF 0800- 0FFF E1-EF 80-BF 80-BF 1000- F0 90-BF 80-BF 80-BF1- 3 F1-F3 80-BF 80-BF 80-BF4- F F4 80-8F 80-BF 80-BF 10-10 - */It should be
RSS newsfeed for Alan Wood's Unicode Resources
Until now, it has not been easy to find new entries for fonts and programs in my collection of Unicode resources, so I have implemented a newsfeed: http://www.alanwood.net/news/unicode.rss More information about the feed can be found at: http://www.alanwood.net/news/index.html I hope you will find it useful. Alan Wood http://www.alanwood.net (Unicode, special characters, pesticide names)
Re: Vertical BIDI
Andrew C. West scripsit: > The only thing that is certain is that Ogham must be rendered BTT in > vertical contexts. For Ogham text in isolation this is fairly easy to > accomplish by simple rotation, and one could expect "writing-mode > : bt-rl" or "writing-mode : bt-lr" to accomplish this in a CSS > stylesheet. Whether the columns should run LTR or RTL across the page > is another question, although LTR would be simplest to implement as > it would only involve rotating a whole block of horizontal LTR Ogham > text 90 degrees anticlockwise. At any rate, vertical presentation is > a matter for a higher protocol, and not a Unicode matter. I think it's clear by now that bt-lr is the Right Thing. (A great pity that the Irish monks didn't record horizontal Ogham RTL! If you are standing in front of an Ogham-inscribed archway, the curve of the text does pass from your right side to your left side (and the same for a standing stone if you in imagination flatten out the sides), and the monks must have had *some* familiarity with Hebrew or Arabic.) > However, Ogham text embedded in Mongolian may be a different matter. If > a plain text editor renders everything horizontally, as most do, then > both Mongolian and Ogham should be rendered LTR thus mongolian>, but if you then select vertical presentation (assuming > your text editor has this option) Mongolian should be rendered TTB and > Ogham BTT thus . I still have no idea as > to how this should be achieved. My "hack" of using a custom rotated > Ogham font and RLO/PDF codes would achieve the desired result for > vertical presentation, but would make the Ogham text RTL for horizontal > presentation, which is apparently unacceptable. But what alternatives > are there ? To introduce a concept of bidi override into stylesheet languages. You need something like this anyway to handle the case of lr Latin with embedded Han, where the Latin reads BTT and the Han reads TTB. Fundamentally, vertical scripts like Han and Mongolian and Ogham have an essential vertical directionality and a preferred horizontal one (but they can sometimes tolerate the other direction: RTL Han is not unknown). Horizontal scripts have an essential horizontal directionality and may or may not have a preferred vertical one. -- Long-short-short, long-short-short / Dactyls in dimeter, Verse form with choriambs / (Masculine rhyme): [EMAIL PROTECTED] One sentence (two stanzas) / Hexasyllabically http://www.reutershealth.com Challenges poets who / Don't have the time. --robison who's at texas dot net
Re: Vertical BIDI
From: "John Cowan" <[EMAIL PROTECTED]> > The difficulty arises when Ogham is mixed with vertical Han or with > Mongolian, since once the basic directionality becomes vertical, the > tendency to read the Ogham BTT will become automatic. This is analogous > to the problem that fantasai has pointed out with Latin script written > in lr progression when Han gets mixed in: the normal reading direction > of lr-Latin is BTT, but any Han included will automatically be read TTB, > corrupting it. corrupting is probably a bad term here. Latin vertical text is _often_ written by rotating it 90 degrees counterclockwise (same rotation direction for angled presentation at 45 degrees, commonly found in the header row of tables with many narrow columns), so that it reads bottom to top. But the clockwise rotation is also possible (commonly found in the footer row of tables with many narrow columns). For Latin, the rotation of the baseline is a matter of style. In Han or Kana texts, occurences of Latin can occur in either direction, but with different baseline orientation. Less often (?), the baseline of Latin glyphs is not rotated but glyphs are put one below the previous one like in crosswords (will happen mostly for uppercase-only style, as this style is horrible with lowercase letters). This presentation would be consistent with traditional vertical Han presentation (where glyphs are keeping their horizontal baseline, without being rotated); it may be ideal for small inclusions of Latin in Han texts, however it is inadapted for the cursive handwritten form, where the writer would probably turn his paper 90 degrees counterclockwise for writing it. Latin is quite permissive for the rotation of its glyphs, because the baseline orientation is very easy to figure out without ambiguities for readers. This is not true for Ogham where you need to know the language to see in which direction the characters must be read and interpreted.
Re: Vertical BIDI
Michael Everson wrote: > > Come on, people. Read the standard, please. It's on page 338. Michael is absolutely right to rebuke me for not reading the Standard. Of course I have read the Ogham block intro before, and no doubt that is where I got the notion of rendering Ogham BTT from, but I had forgotten that Ogham's BTT directionality is explicitly mentioned there. If only I had reread the block intro before joining this thread I wouldn't have ended up rambling down a dead end in my recent postings. But now that I'm back on the marked path the way forward is still as unclear as ever. The only thing that is certain is that Ogham must be rendered BTT in vertical contexts. For Ogham text in isolation this is fairly easy to accomplish by simple rotation, and one could expect "writing-mode : bt-rl" or "writing-mode : bt-lr" to accomplish this in a CSS stylesheet. Whether the columns should run LTR or RTL across the page is another question, although LTR would be simplest to implement as it would only involve rotating a whole block of horizontal LTR Ogham text 90 degrees anticlockwise. At any rate, vertical presentation is a matter for a higher protocol, and not a Unicode matter. However, Ogham text embedded in Mongolian may be a different matter. If a plain text editor renders everything horizontally, as most do, then both Mongolian and Ogham should be rendered LTR thus , but if you then select vertical presentation (assuming your text editor has this option) Mongolian should be rendered TTB and Ogham BTT thus . I still have no idea as to how this should be achieved. My "hack" of using a custom rotated Ogham font and RLO/PDF codes would achieve the desired result for vertical presentation, but would make the Ogham text RTL for horizontal presentation, which is apparently unacceptable. But what alternatives are there ? Andrew
Re: Vertical BIDI
Philippe Verdy scripsit: > > In fact no; both Mongolian (or Manchu, which is unified with it in > > Unicode) and Chinese are written TTB. > > Then, why did you say that: > > > What's uncertain is whether a lr or a rl progression is favored, > > given the paucity of evidence. Michael favors lr progression. > > There is no question that the text is read BTT. That statement refers to Ogham, not Mongolian! Ogham carved on stone is read up one side of the stone, then (if necessary) across the top of the stone, then (if necessary) down the other side of the stone. Now maybe it's just a mistake to assimilate this scheme to any kind of two-dimensional layout, since all known instances of Ogham on manuscript are ordinary horizontal L2R, like Latin (with which it is most often mixed). The difficulty arises when Ogham is mixed with vertical Han or with Mongolian, since once the basic directionality becomes vertical, the tendency to read the Ogham BTT will become automatic. This is analogous to the problem that fantasai has pointed out with Latin script written in lr progression when Han gets mixed in: the normal reading direction of lr-Latin is BTT, but any Han included will automatically be read TTB, corrupting it. *sigh* One of my favorite lines in the Unicode Standard reads: "There simply is no traditional Japanese method of typesetting Devanagari." -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] There are books that are at once excellent and boring. Those that at once leap to the mind are Thoreau's Walden, Emerson's Essays, George Eliot's Adam Bede, and Landor's Dialogues. --Somerset Maugham