Re: U+0140
From: John Hudson [EMAIL PROTECTED] 'Careful hairsplitting' always takes place when people care about typography. How very true. On one hand, there's people who put a cedilla under a when typesetting Polish, on the other hand, there's people who adjust the vertical position of hyphens when typesetting all-caps. And there's lot in-between. But it is important to realize that there _always_ were people who adjusted the hyphen in all-caps settings. Gutenberg's own typesetting was careful hairsplitting. This is a very typical and essential dilemma, which is one of the reasons why there is no easy answer to the glyph vs. character question, or more precisely, why the character definition in Unicode is so, well, vague. Since the decision on what is a character and what is merely a glyph variant is made somewhat arbitrarily (albeit in a committee process). There are far too many exceptions to the rule for Unicode to be consistent and easy-to-use. But since written human language never was consistent and easy-to-use, I guess it's something very natural and we will all live with that. Adam
Downloading UCD 4.0.0
Hi, Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, but in http://www.unicode.org/ucd/ I read this: The complete set of all files for a given version of the UCD consists of the files in the update directory for that version, together with all the files unchanged from earlier versions, which are kept in their respective update directories. Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? Theo
JIS X 0213: 2000 AMD-1 and Unihan.txt
Would it be reasonable to expect that data concerning the ten characters added to JIS X 0213 by Amendment 1 will make it into the next version of Unihan.txt? I'm presuming that this is official since ISO-IR-233, which updates ISO-IR-228, was released on 13 April. [Relevant data from ISO-IR-233] Unicode = Min,Ku,Ten U+4FF1 = 1,14,01 U+525D = 1,15,94 U+541E = 1,47,94 U+5653 = 1,84,07 U+59F8 = 1,94,90 U+5C5B = 1,94,91 U+5E77 = 1,94,92 U+7626 = 1,94,93 U+7E6B =1,94,94 U+20B9F = 1,47,52 Ernest Cline [EMAIL PROTECTED]
Re: Downloading UCD 4.0.0
Theo Venker asked: Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, That is an incorrect assumption. but in http://www.unicode.org/ucd/ I read this: The complete set of all files for a given version of the UCD consists of the files in the update directory for that version, together with all the files unchanged from earlier versions, which are kept in their respective update directories. Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? Yes. The relevant information for *each* version of the Unicode Standard is at: http://www.unicode.org/standard/Versions/enumeratedversions.html As it happens, almost *every* data file was updated for Unicode 4.0, so almost everything is available specifically in http://www.unicode.org/Public/4.0-Update/ The only normative files that were unchanged from an earlier version were: http://www.unicode.org/Public/3.2-Update/Jamo-3.2.0.txt http://www.unicode.org/Public/3.2-Update/Unihan-3.2.0.zip Of course, the update for Unihan.txt was one of the main reasons for the Unicode 4.0.1 release. The only other file that was unchanged was the character index to the book: http://www.unicode.org/Public/3.2-Update/Index-3.2.0.txt which, for production reasons, was not updated again until the release of Unicode 4.0.1. --Ken
Re: Downloading UCD 4.0.0
At 08:42 AM 4/19/2004, Theo Veenker wrote: Hi, Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, but in http://www.unicode.org/ucd/ I read this: The complete set of all files for a given version of the UCD consists of the files in the update directory for that version, together with all the files unchanged from earlier versions, which are kept in their respective update directories. Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? Yes. And depending on what version of the UCD you are trying to piece together you may need potentially versions of some files from several earlier updates. A./ PS: we are looking into ways to make access to older versions more straightforward.
FW: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler
Mino, I am sending your question to the Unicode public email list http://www.unicode.org/consortium/distlist.html for a possible answer from one of the list subscribers. Regards, --- Magda Danish Sr. Administrative Director The Unicode Consortium 650-693-3921 [EMAIL PROTECTED] -Original Message- Date/Time:Mon Apr 19 05:09:20 EDT 2004 Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Opt Subject: Unicode conversion I would like to convert a 2 byte Unicode code into its corresponding Unicode character (for instance the decimal 1063 or the hexadecimal 0427 into 'Ч'). Is there a C function in order to make the conversion? What file .h do I need to include in the C program? Can I use the 6.0 version of the Microsoft Visual C++ compiler, or do i need a newer version? Thanks a lot in advance. Mino Napoletano -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
Re: JIS X 0213: 2000 AMD-1 and Unihan.txt
Yes, it's reasonable. In fact, the data have already been added, but this was done just too late for inclusion in the 4.0.1 release. On Apr 19, 2004, at 12:23 PM, Ernest Cline wrote: Would it be reasonable to expect that data concerning the ten characters added to JIS X 0213 by Amendment 1 will make it into the next version of Unihan.txt? I'm presuming that this is official since ISO-IR-233, which updates ISO-IR-228, was released on 13 April. [Relevant data from ISO-IR-233] Unicode = Min,Ku,Ten U+4FF1 = 1,14,01 U+525D = 1,15,94 U+541E = 1,47,94 U+5653 = 1,84,07 U+59F8 = 1,94,90 U+5C5B = 1,94,91 U+5E77 = 1,94,92 U+7626 = 1,94,93 U+7E6B =1,94,94 U+20B9F = 1,47,52 Ernest Cline [EMAIL PROTECTED] John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jhjenkins/
Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler
Mino, This is not at clear: the character U+0427 is in the Cyrillic block, and what does this have to do with the two characters and , which are U+ 00D0 and U+00A7 ? Are you wondering how to store 0x0427 in a binary file ? Or what ? Raymond Mercier Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Opt Subject: Unicode conversion I would like to convert a 2 byte Unicode code into its corresponding Unicode character (for instance the decimal 1063 or the hexadecimal 0427 into ''). Is there a C function in order to make the conversion? What file .h do I need to include in the C program? Can I use the 6.0 version of the Microsoft Visual C++ compiler, or do i need a newer version? Thanks a lot in advance. Mino Napoletano -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
Re: U+0140
John Hudson responded to Michael Everson: Michael Everson wrote: This would make the mid-dot too high. The top dot of the colon usually sits toward the top of the x-height; the *mid*-dot should sit lower, John, I just don't believe you. I don't believe that in all the history of Greek and Catalan typography this careful hairsplitting has *always* taken place; certainly in scientific transcription the HALF TRIANGULAR COLON is just the top dot in the TRIANGULAR COLON, and in Americanist transcription where the dot-colons are used instead of triangles I would say the same applies. I never contested that the dots of a colon correspond to the triangles of the linguistic long vowel marker. They clearly do. What I contested was that the typographic mid-point (U+00B7) corresponded to the top dot of a colon. It clearly does not. It is called a mid-point because it sits midway up the x-height. It is used in this position for a variety of stylistic purposes, ... I think we have two typographers here arguing somewhat at cross-purposes. Clearly the typographic mid-point behaves as John has mentioned, and is designed as such in many fine fonts (examples seen among the exhibits that Asmus gathered). But just a clearly, there is a long, long tradition in Americanist orthographic practice (which is used widely for linguistic orthographies outside of Native America as well) of using a raised dot for an indication of vocalic (and occasionally consonantal) length. For 100 years, that raised dot was mechanically generated by, among other means, filing the lower dot off a colon key on a mechanical typewriter. (I have such a typewriter sitting on my desk.) Linguists got used to this raised dot height, coordinated with a colon in design (which then could be used, among other things to indicate a prolonged length, when two degrees of length were in question), and that preference made its way into print, at least for many North American languages, where the raised dot could be printed at x-height, rather than at midway up the x-height, which would be too low for most of the linguistic usage. Enter the electronic age. ASCII had no MIDDLE DOT. It was period (.), colon (:) or the highway. Early linguistic material on computers made do with those, because they had no choice. The IBM PC and the Macintosh introduced a MIDDLE DOT (0xFA [= IBM CDRA SD63 Middle Dot] and 0xE1, respectively). When ISO 8859-1 was defined, it also had a MIDDLE DOT (0xB7). *Everybody* made use of that MIDDLE DOT for anything that was vaguely in the ballpark -- the typographical mid-point, the linguistic length mark, the mathematical multiplication operator, the Greek ano teleia, the dictionary hyphenation point, and, yes, the Catalan middle dot. The fact that each of those usages might have extremely fine typographical hairs to split regarding the rendering was so much horsepucky as far as the character identity was concerned. You used what you had available to represent your data. The Unicode Standard, for a variety of reasons -- some of which included compatibility mapping concerns to other standards which had started to proliferate middle dots -- added a collection of middle dots *besides* U+00B7, *the* middle dot, to its repertoire. Those other middle dots give people textual representation alternatives now, if they need to make distinctions, and textual rendering alternatives, if they need to make middle dots which display with slightly different heights, sizes, or spacings, depending on the rendering requirements. What is clear, however, is that it is utterly impossible to satisfy everybody regarding middle dots. Typographical purists will always want plain text to make more distinctions. Text processing requirements will abhor the splitting of text representation into more and more difficult-to- distinguish glyph representations without clear semantic differences. And dot proliferation *always* poses difficulty for establishing character properties. Before people bluster on too much further on this thread, it would be good for everyone to recall that the *reason* why U+00B7 has problematical properties is that it was inherently ambiguous in *preexisting* usage (that is, prior to Unicode altogether) as punctuation versus length mark (and other things as well). This puts it in the same grabbag of very difficult, ambiguous ASCII characters, such as ~, *, and ' which also acquired conflicting usages during their reign among the small set of available punctuation and symbols in ASCII. History has consequences. The history of a character's encoding also has consequences for how the Unicode Standard is to be used and interpreted. --Ken
Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler
I think this was just a confused way of asking how to convert UTF-16 into UTF-8: U+0427 is the Unicode encoded character. 0x0427 is the UTF-16 character encoding form for it. 0xD0 0xA7 is the UTF-8 character encoding form for it. Mino, sample code for how to do this is available at: http://www.unicode.org/Public/PROGRAMS/CVTUTF/ Many Unicode support libraries will have a UTF-16 -- UTF-8 conversion routine built in somewhere. Check in the documentation of the libraries for details. This isn't a standard C function call -- it is in the libraries. --Ken Mino, This is not at clear: the character U+0427 is Ч in the Cyrillic block, and what does this have to do with the two characters Ð and §, which are U+ 00D0 and U+00A7 ? Are you wondering how to store 0x0427 in a binary file ? Or what ? Raymond Mercier Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Opt Subject: Unicode conversion I would like to convert a 2 byte Unicode code into its corresponding Unicode character (for instance the decimal 1063 or the hexadecimal 0427 into 'Ч'). Is there a C function in order to make the conversion? What file .h do I need to include in the C program? Can I use the 6.0 version of the Microsoft Visual C++ compiler, or do i need a newer version? Thanks a lot in advance. Mino Napoletano
Re: U+0140
On 19/04/2004 13:03, Kenneth Whistler wrote: ... Those other middle dots give people textual representation alternatives now, if they need to make distinctions, and textual rendering alternatives, if they need to make middle dots which display with slightly different heights, sizes, or spacings, depending on the rendering requirements. Ken, does Unicode specify height, size and spacing distinctions between the various middle dots which you listed? If I understand correctly, it certainly doesn't do so exhaustively. So in effect what you are suggesting here is that people make and use their own private distinctions between characters which are not defined by Unicode. This sounds very like advising people to ignore Unicode character identiies and properties and do their own thing. Rather strange advice from someone in your position, surely? Surely, in the current situation and if further proliferation of middle dots is considered undesirable, users should be advised to presume that distinctions between middle dots are not a plain text matter and so should be handled by markup, including language selection. And if (as I just suggested on the Hebrew list might be true of some variant Hebrew pointing systems) someone finds a well documented script in which a true middle dot and an x-height dot are used contrastively, the correct approach would be either to accept, reluctantly, that at least one new dot needs to be encoded; or else for Unicode to define clearly which existing character should be used for which dot in this script. The worst thing that could happen would be for different text providers to make different and incompatible selections among the existing characters, leading to total confusion. But that seems to be the approach which you, Ken, are advocating. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Web Form: Subj: Unicode conversion- Microsoft Visual C++ comp iler
It may be even simpler than that: U+0427 may have appeared in his message in UTF-8 because of his mail client. It could be that he's asking how to convert from an int holding the number 1063 to a wchar_t holding U+0427. The answer to this question is: int charValue = 1063; wchar_t utf16Char = (wchar_t)charvalue; Cheers - rick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kenneth Whistler Sent: April 19, 2004 13:47 To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Web Form: Subj: Unicode conversion- Microsoft Visual C++ compiler I think this was just a confused way of asking how to convert UTF-16 into UTF-8: U+0427 is the Unicode encoded character. 0x0427 is the UTF-16 character encoding form for it. 0xD0 0xA7 is the UTF-8 character encoding form for it. Mino, sample code for how to do this is available at: http://www.unicode.org/Public/PROGRAMS/CVTUTF/ Many Unicode support libraries will have a UTF-16 -- UTF-8 conversion routine built in somewhere. Check in the documentation of the libraries for details. This isn't a standard C function call -- it is in the libraries. --Ken Mino, This is not at clear: the character U+0427 is Ч in the Cyrillic block, and what does this have to do with the two characters Ð and §, which are U+ 00D0 and U+00A7 ? Are you wondering how to store 0x0427 in a binary file ? Or what ? Raymond Mercier Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Opt Subject: Unicode conversion I would like to convert a 2 byte Unicode code into its corresponding Unicode character (for instance the decimal 1063 or the hexadecimal 0427 into 'Ч'). Is there a C function in order to make the conversion? What file .h do I need to include in the C program? Can I use the 6.0 version of the Microsoft Visual C++ compiler, or do i need a newer version? Thanks a lot in advance. Mino Napoletano
RE: U+0140
And if... someone finds a well documented script in which a true middle dot and an x-height dot are used contrastively, That would be a somewhat surprising and not-to-be-recommended design for a writing system. Not to be completely ruled out, though. But we can probably wait to cross that encoding bridge when we come to it. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
Re: U+0140
Peter Kirk continued this... On 19/04/2004 13:03, Kenneth Whistler wrote: ... Those other middle dots give people textual representation alternatives now, if they need to make distinctions, and textual rendering alternatives, if they need to make middle dots which display with slightly different heights, sizes, or spacings, depending on the rendering requirements. Ken, does Unicode specify height, size and spacing distinctions between the various middle dots which you listed? No. If I understand correctly, it certainly doesn't do so exhaustively. Correct. So in effect what you are suggesting here is that people make and use their own private distinctions between characters which are not defined by Unicode. Not at all. I am suggesting that people who use Unicode characters *will* use them according to their identity. However, that doesn't mean that identification of a character neatly solves all issues of their rendering, nor will it automatically make things neat and tidy when people use characters in different contexts which may have different rendering concerns. The Unicode Standard is not prescriptive about rendering, beyond the basics required to simply ensure correct mapping of textual content into streams of characters. If one font vendor wants to have a raised glyph for the MIDDLE DOT and another wants to have a lowered glyph for the same character, it is not the Unicode Standard's business to put the two vendors in a room until one gives up and admits the other one is correct. This sounds very like advising people to ignore Unicode character identiies and properties and do their own thing. Rather strange advice from someone in your position, surely? I love the way you put positions in peoples' mouths. By the way, I challenge you to point to the Unicode character properties in the Unicode Character Database which define the relative position for middle dots with respect to x-height of a font, or the spacing of middle dots, for example. Surely, in the current situation and if further proliferation of middle dots is considered undesirable, It is undesirable, yes. users should be advised to presume that distinctions between middle dots are not a plain text matter No, they should not. Because the existence of multiple different middle dots in the standard which are *not* canonical equivalents of each other makes it a plain text matter. and so should be handled by markup, including language selection. In some cases, yes -- it depends on the effect which is intended, and the context and application it occurs in. And if (as I just suggested on the Hebrew list might be true of some variant Hebrew pointing systems) someone finds a well documented script in which a true middle dot and an x-height dot are used contrastively, the correct approach would be either to accept, reluctantly, that at least one new dot needs to be encoded; or else for Unicode to define clearly which existing character should be used for which dot in this script. Or: None of the Above The users of characters for particular domains bear their own responsibility to define their usage. It is not up to the Unicode Consortium to go around defining everyone's spelling rules and orthographic conventions for them. If there are things unclear in the standard which are making its use difficult for people in certain cases, then that is certainly a concern of the Unicode Technical Committee. And if someone brings in convincing evidence of the existence of a semantically significant plain text distinction between two dots that cannot plausibly be handled by *any* combination of the multitudinous dot characters already present in the standard, then the UTC might consider that sufficient justification to encode yet another middle dot. Given, however, the fact that there already are so many dot characters, and given that their rendering often varies by font, the chance of getting some additional pair of dot distinctions by height on the line canonized with yet another dot encoding seems unlikely to me. It is a will-'o-the-wisp to expect any and all multilingual Unicode text to display correctly to any arbitrary n-th degree of typographical rectitude with any and all Unicode-conformant fonts. The use of specific fonts with specific designs is *precisely* to enable plain text (or marked-up text, for that matter) to be displayed as desired for particular contexts. The criterion for Unicode plain text is basically *legible* text. The worst thing that could happen would be for different text providers to make different and incompatible selections among the existing characters, leading to total confusion. But that seems to be the approach which you, Ken, are advocating. I see. And thank you, Peter, for pointing that error out to me. Text providers have their own responsibility to ensure that they are using interoperable conventions for the representation of
Re: Diacritic Property and Phillipine Viramas
Ernest Cline asked: Is there a reason for the lack of the Diacritic property on the Tagalog and Hanunoo virama characters (U+1714 and U+1734)? Human fallibility? All of the other virama characters (i.e., those of combining class 9) have this property and it seems appropriate based on the description of these characters in Chapter 10. I think you are correct. --Ken Ernest Cline [EMAIL PROTECTED]
Re: U+0140
Peter Constable wrote: And if... someone finds a well documented script in which a true middle dot and an x-height dot are used contrastively, That would be a somewhat surprising and not-to-be-recommended design for a writing system. Not to be completely ruled out, though. But we can probably wait to cross that encoding bridge when we come to it. We already have conrasted use of a baseline dot (period or full stop) and a mid-dot (word separator or stylistic hyphen), so why would you be surprised by contrasted use of mid-dot and x-height dot? Vertical alignment is clearly sometimes a semantic feature. I've seen plenty of business cards in which the mid-dot is used as a stylistic division between parts of a telephone number instead of spaces, periods or hyphens. I don't like the style, but people do it. Presumably some Greek people do it also, in which case they are contrasting the mid-dot and the ano teleia. John Hudson -- Tiro Typeworkswww.tiro.com Vancouver, BC[EMAIL PROTECTED] I often play against man, God says, but it is he who wants to lose, the idiot, and it is I who want him to win. And I succeed sometimes In making him win. - Charles Peguy
Re: Downloading UCD 4.0.0
Theo Veenker Theo dot Veenker at let dot uu dot nl wrote: Until now I always downloaded the lastest version of the UCD and worked with that. Now I want to download the UCD files for 4.0.0 again. I know it is all in http://www.unicode.org/Public/- 4.0-Update/, ... Do I really need to find out and download all unchanged files from 3.2.0 and earlier, just to get the files for 4.0.0? and Kenneth Whistler kenw at sybase dot com responded: Yes. The relevant information for *each* version of the Unicode Standard is at: http://www.unicode.org/standard/Versions/enumeratedversions.html ... I think the answer depends on what Theo really wants. He asked about downloading the data files for 4.0.0, but before that he mentioned downloading the latest version, which is not 4.0.0 but 4.0.1. If Theo really wants the 4.0.0 data files, he needs to download not only from 4.0-Update but also from 3.2-Update, as Ken said. If all he wants is the latest version (4.0.1), he can go to: http://www.unicode.org/Public/UNIDATA/ which not only has all the files, but has the added advantage that he doesn't have to strip the -x.x.x version number from the file names if he's only interested in replacing old files with new ones. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Unihan.txt and the four dictionary sorting algorithm
While I would expect the answer to my question to be true, one never knows what lurks in the heart of data files. Unihan.txt contains at least two properties for each of the four dictionaries used in the sorting algorithm. One property contains only characters that are actually in the dictionary while the other contains interpolations as well. Is it always the case that a character is in one of these dictionaries if and only if the two properties have the same value and always end in 0. For example, if there is a value of kIRGKungXi of the form .YY0 there will always be the same value for the kKangXi for that character and vice versa. I'm trying to pare Unihan.txt down to a less unwieldy size for my own use by eliminating properties that are of no interest to me and would like to be certain that eliminating the four properties containing the actual values for those dictionaries can be done safely because the information can be reconstituted if necessary from the kIRG* properties since I'm not certain if those properties are of interest to me. Ernest Cline [EMAIL PROTECTED]
Re: Downloading UCD 4.0.0
I wrote: I think the answer depends on what Theo really wants. He asked about downloading the data files for 4.0.0, but before that he mentioned downloading the latest version, which is not 4.0.0 but 4.0.1. Reading Theo's question again, I see that he was talking about having downloaded the latest version until now and now wants to download 4.0.0 again, which he recognizes is not the latest version. So Ken's answer was the appropriate one. Read first, Doug, then write. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/