Re: Private Use proposals (long)
Michael Everson everson at evertype dot com wrote: At 08:51 -0700 2002-05-21, Doug Ewell wrote: (Deseret and Shavian were encoded in ConScript; whether that helped get them into Unicode or not, I don't know.) Certainly not. They were examined on their merits just like anything else. Of course they were. By helped I didn't mean that the characters wouldn't otherwise have been worthy of encoding, but that the CSUR assignments might have resulted in additional usage, which in turn got the attention of UTC and/or WG2. I'm trying to examine the passage in TUS 3.0, Section 13.5 (p. 323) which seems to have caught Mr. Overington's fancy: quote Promotion of Private-Use Characters. In future versions of the Unicode Standard, some characters that have been defined by one vendor or another in the Corporate Use subarea may be encoded elsewhere as regular Unicode characters if their usage is widespread enough that they become candidates for general use. The code positions in the Private Use Area are permanently reserved for private use -- no assignment to a particular set of characters will ever be endorsed by the Unicode Consortium. /quote Ignoring the last sentence, because we all seem to be on board with that, I think the image of the PUA that may have emerged from this is that of a test bed for proposed characters. In this scenario, characters are encoded in the PUA *so that* they will gain increased usage, *so that* the UTC will take note of the increased usage and respond by promoting the character to Unicode. (I think the use of the word promotion in the 13.5 subhead is turning out to be a bad idea, as it implies a simple and straightforward progression.) As I mentioned earlier, as far as I know no script or character has followed this path deliberately -- that is, been encoded in the PUA for the express purpose of satisfying Unicode's widespread usage requirement. Of course, we all know (don't we?) that a script or character must satisfy many other criteria as well. Deseret and Shavian obviously did satisfy those criteria, as well as being judged to have sufficiently widespread usage. Those additional criteria -- not frequency of usage -- are what will prevent additional Latin ligatures from being promoted to Unicode. To answer (I hope) some of William's other points: Well, the ideas are not intended to be quasi-official. Just one end user of the Unicode system seeking to use the Private Use Area to good effect and putting forward ideas to other end users who might like to consider using some of the facilities suggested. Hooray for that. The PUA is there for just that purpose. However, in the spirit of using Unicode, please also respect the character-glyph model, which says (among other things) that a ligature is a glyph requiring a font rendering, not a character requiring a code point. Now, the fact is that Michael suggested a feature named ZERO WIDTH LIGATOR specifically for the purpose of ligation and it appears that that suggestion has not been accepted, but that a shared solution with a code point that can also mean something else has been decided upon. Now, I do not know the details of all of this and I certainly hope to study the matter more, yet, as someone who is not a linguist as such but an inventor and programmer, I have a concern that using one code point for two types of meaning rather than one code point for each type of meaning is what I call a software unicorn. The concept of a software unicorn can be read about on http://www.users.globalnet.co.uk/~ngo/euto0008.htm if anyone is interested. I gather from the article that a software unicorn is an unlikely, perhaps impossible, situation that nevertheless must be handled because it cannot be completely ruled out. Lots of defensive code gets written to handle such situations, often with a comment like: default:// this can't happen, but... In this context, I think William is saying that it's risky to overload ZWJ to handle Latin ligation because we can't completely rule out the possibility that we might need ZWJ to join Latin characters the way it currently joins Arabic characters. This concern can probably be put to rest by reading the description in Section 13.2 of the Unicode 3.1 Technical Report (UAX #27). The description carefully spells out the relationship between cursively connected and ligated renditions and the roles ZWJ and ZWNJ play in determining the rendition to be used. As to strong opposition to encoding additional presentation forms for alphabetic characters, well, we live in a democratic society and if some people who would like to produce quality printing feel that using a TrueType fount with some ligature characters does what they want and harms no one else, what exactly is the objection? Ah, but it *isn't* harmless. It causes problems for normalization. For homework tonight, read UAX #15, Unicode Normalization Forms. The key point for our discussion is that
Courtyard Codes and the Private Use Area (derives from Re: Encoding of symbols and a lock/unlock pre-proposal)
Peter Constable included the following in his post. As for PUA, many people have their own plans regarding U+F300..U+F3FF. For my own part, my plans for U+F300..U+F3FF almost certainly do not involve padlock symbols. Thank you for your email. As is well known, the Unicode Consortium will not endorse any code point allocations in the Private Use Area and everyone has the right to allocate none, some or all code points in the Private Use Area as he or she chooses, and to publish them if he or she so chooses. This is an interesting situation. If one views the situation from the inside looking out, then it becomes impossible for there to be any certainty as to what is the intended meaning of a code point from the Private Use Area which is used in a Unicode plain text file on the basis of examining the code points. However, if one views the situation from the outside looking in, a somewhat different situation arises. Suppose that I define a .eut file format to be structurally a Unicode plain text file with the added feature that all code points that are within the Unicode Private Use Area are defined to have the meanings which I give them in my eutocode set of code point allocations. So, a .eut file could be a rigorously defined file format, just as is .bmp or .png. If a wordprocessing package were to have a selection option for reading in files of a .eut format, then there would be no confusion whatsoever about the meaning of, say, a U+E707 character: it would be a ct ligature. Now, suppose I define a .uto file format to be structurally a Unicode plain text file with the added feature that all code points that are within the U+F3.. block of the Private Use Area have the meanings of a set of codes called Courtyard Codes, and all other code points that are within the Private Use Area have an undefined meaning, unless a sequence of some of the Courtyard Codes has indicated from which type tray all subsequent Private Use Area codes which are not in the U+F3.. block are to be regarded as coming. A wordprocessing package could be programmed by its manufacturer to accept input in .uto file format, with accuracy of meaning for every code point used in the file, even if some Private Use Area code points were used to have two different meanings in two parts of the same document. I like to imagine an analogy of the way that Unicode code points can be defined as if there is a large kitchen table which is plane 0. Onto most parts of the table, pieces of coloured paper are laid, always taking care that no piece of paper overlaps any other piece of paper, so that the table surface is only covered by one thickness of paper. On an area about one tenth of the total area of the table is an area called the Private Use Area, and here paper can be piled. Perhaps 500 sheets of paper could be piled upon this area. So, if someone says, of some particular place on the surface of the table What colour is the paper? then for parts of the table that are not in the Private Use Area, the colour of the paper can be stated. However, for the Private Use Area, the colour of the paper cannot be stated with certainty. It depends upon which piece of paper is being viewed at any one time. Suppose, however, that the people who are placing the paper onto the Private Use Area agree amongst themselves that they like the look of that nice yellow square of paper that takes up a small part of the Private Use Area and will voluntarily avoid placing any paper on top of it. One would then end up with a Private Use Area that has coloured paper piled up all over it, except for in one small area where there is a yellow square. The net effect would be that the area covered by the yellow square would be as uniquely defined as to the colour of paper upon it as anywhere not in the Private Use Area. Now, the question that naturally arises is as follows. Will all end users agree to keep the U+F3.. area only for the Courtyard Codes? Who knows? I suggest however that it is possible that they will, because I hope that, when they consider the matter, that people will feel that it is to their own advantage to do so. I feel that if everybody who wishes to make definitions into the Private Use Area learned of the existence of the Courtyard Codes and finds that the features that it could provide for them are extremely useful and may, in time, become built into widely used software packages, then they might well do so. What would this take? Ease of use. Where a wordprocessing package or a desktop publishing package or whatever has an option for reading in a Unicode plain text file it would also have an option for reading in a .uto file. The Courtyard Codes would need to be well defined, publicly available, free to use and free of legal entanglements. Please note that I have chosen the name Courtyard Codes for the system as Courtyard and Codes are two English words, not words specially coined. I got the idea of using the word Courtyard
Please help: problem with Netscape 6.x
Please reply directly to [EMAIL PROTECTED] Thanks. Magda -Original Message- Date/Time:Fri May 24 04:02:24 EDT 2002 Contact: [EMAIL PROTECTED] Report Type: General question Text of the report is appended below: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- I'm developing an internet application for user to input multi-lingual data. It works fine using internet explorer 5.x and above. However, it does not work for Netscape even for version 6.x. 1. All my web pages are set the charset to UTF-8. head meta http-equiv=Content-Type content=text/html; charset=utf-8 ... /head 2. My database is set to accept UTF-8 or Unicode characters. 3. My browser encoding selects UTF-8. IE 5.x: View Encoding Auto-Select or Unicode (UTF-8). Netscape 6.x: View Character Coding Auto-Detect Auto-Detect (All) or Unicode (UTF-8). Problem description (in netscape): When I retrieved data (in format such as #x3B1; - my non-english data is stored in the database in this form) from my database and displayed in a page - they displayed correctly. I proceeded to edit some fields, leaving other fields as they were (even for multi-lingual data). When I posted the data, and navigate to the next page, the non-english fields appeared as '?'. I'd spent days trying to figure out the problem, but to no avail. I'm really at my wit- end and really need any help you can offer. I'd got to know your email address while surfing the net for solution and saw your articles on the net. Hope that you really can help. Thanks, Geok Hu -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
Re: Courtyard Codes and the Private Use Area (derives from Re: Encoding of symbols and a lock/unlock pre-proposal)
WO U+F3A2 PLEASE LIGATE THE NEXT TWO CHARACTERS WO U+F3A3 PLEASE LIGATE THE NEXT THREE CHARACTERS WO U+F3A4 PLEASE LIGATE THE NEXT FOUR CHARACTERS While I don't think this discussion of various PUA allocations should continue very further, it's probably a lot better to introduce the already-discussed ZERO WIDTH LIGATOR in such a form that X ZWL Y produces the XY ligature, X ZWL Y ZWL Z the XYZ ligature and so on. It saves you a lot of hassle with longer ligatures. WO U+F3A8 PLEASE SWASH THE NEXT PRINTABLE ITEM WO U+F3A9 PLEASE ALTERNATIVE SWASH THE NEXT PRINTABLE ITEM Does this belong in a character-based encoding system at all? This is better solved by markup. If you go on defining your own file formats already, do include some sensible markup system there, and you don't have to clutter the PUA and restrict their use. What if you've got more than 2 swash forms, BTW? WO U+F3C0 PLAIN - ITALIC:=false; BOLD:=false; WO ... WO U+F3FF 192 POINT Again, markup is the better solution. And, to be honest, it's a bit of a waste of space on the mailing list, don't you think? WO I hope that these Courtyard Codes will be of interest to end users. I don't really think so. They don't offer very much that well-known typesetting systems don't implement already in their own fashion. Philipp Philippmailto:[EMAIL PROTECTED] ___ Stay the patient course / Of little worth is your ire / The network is down
Re: Courtyard Codes and the Private Use Area (derives from Re:Encoding of symbols and a lock/unlock pre-proposal)
William, Your Courtyard codes are a form of formatting markup. Why not use XML? Everyone else will do. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Contributions from meeting 42 - Charts, Resolutions and latestdocument register
Some new WG2 documents: N2492 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2492.pdf Charts - 10646-2 AMD1 Freytag 2002-05-22 N2491 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2491.pdf Charts - 10646-1 AMD2 Freytag 2002-05-22 N2454 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2454.doc Dublin Meeting 42 Resolutions Ksar 2002-05-23 N2450 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2450.htm Partial document register - 2190 - 2472 in reverse Ksar 2002-05-23 Mike Ksar -- Michael Everson *** Everson Typography *** http://www.evertype.com
Adding characters with decompositions (was Re: Private Use proposals)
-BEGIN PGP SIGNED MESSAGE- Doug Ewell wrote: [...] Beyond a certain point in time (defined as Unicode 3.1), no new canonical or compatibility equivalences can be defined. Huh? What about the compatibility ideographs U+FA30..FA6A, added in 3.2? Or U+2047 DOUBLE QUESTION MARK, also added in 3.2? A correct statement of the policy is that no newly assigned character can *canonically* decompose to *two characters* unless it is added to the composition exclusion list. - -- David Hopwood [EMAIL PROTECTED] Home page PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -BEGIN PGP SIGNATURE- Version: 2.6.3i Charset: noconv iQEVAwUBPO6mzjkCAxeYt5gVAQHBswf/dFDQapAcGOWu+lVn47ZIz+azsachQALj aws8rBZ5v9vSPT6KGVxdTVpPRtGEiLtNcFr5mEGnvQBaQ/eYBqb+xFnXVxPbxa4u Vks8wRKTY6cEAlhzNFI/3da9Y9cb77PgAhtJVIniZwbDaqkI1a0K/y9DwJvXSl6O r3RC6J55L2k/B+jbT7JvacRpvrKwOGwvQUiec+krs2u0D0Z64lCUjVAqgVTv+AYQ 1x1blsyqeEZSzH02W5q//JzFHI7+AADd3O1OZpzi3lUiNNTR7NZDBjtX3D4OEuLv AvzkywQCaPbr7Sgl4Y/6uJ6Zz4/FzPS6B1FjoyfE62rGlSTEpvadvA== =BfyB -END PGP SIGNATURE-
Re: Courtyard Codes and the Private Use Area (derives from Re: Encoding of symbols and a lock/unlock pre-proposal)
On Friday, May 24, 2002, at 08:06 AM, Philipp Reichmuth wrote: WO U+F3A2 PLEASE LIGATE THE NEXT TWO CHARACTERS WO U+F3A3 PLEASE LIGATE THE NEXT THREE CHARACTERS WO U+F3A4 PLEASE LIGATE THE NEXT FOUR CHARACTERS While I don't think this discussion of various PUA allocations should continue very further, it's probably a lot better to introduce the already-discussed ZERO WIDTH LIGATOR in such a form that X ZWL Y produces the XY ligature, X ZWL Y ZWL Z the XYZ ligature and so on. It saves you a lot of hassle with longer ligatures. Zero width ligator was rejected. Zero-width joiner can be used to mark ligation points where they are absolutely necessary; where they are merely stylistic preferences, they belong in markup. == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage.mac.com/jenkins/
Re: Courtyard Codes and the Private Use Area (derives from Re: Encoding of symbols and a lock/unlock pre-proposal)
From: John H. Jenkins [EMAIL PROTECTED] Sent: Friday, May 24, 2002 1:54 PM On Friday, May 24, 2002, at 08:06 AM, Philipp Reichmuth wrote: WO U+F3A2 PLEASE LIGATE THE NEXT TWO CHARACTERS WO U+F3A3 PLEASE LIGATE THE NEXT THREE CHARACTERS WO U+F3A4 PLEASE LIGATE THE NEXT FOUR CHARACTERS While I don't think this discussion of various PUA allocations should continue very further, it's probably a lot better to introduce the already-discussed ZERO WIDTH LIGATOR in such a form that X ZWL Y produces the XY ligature, X ZWL Y ZWL Z the XYZ ligature and so on. It saves you a lot of hassle with longer ligatures. Zero width ligator was rejected. Zero-width joiner can be used to mark ligation points where they are absolutely necessary; where they are merely stylistic preferences, they belong in markup. But with that said, I have to agree with Philipp -- the PUA discussion really needs to end. William, please start thinking of the PUA as the city dump. Everyone is glad it is there when you have to stick something somewhere, but no one really talks about it much and no one *ever* wants to take things out of it and strew it on their nice, clean characters. :-) MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/
Language name questions
Hi, I am trying to determine the names of a few languages in their own language. This is for a list of language names that a user can select, like: English Français 日本語 and so on. I need answers to some particular questions, but if someone could point me at a book or web site, then that would be even better. Here are the languages I'm trying to pin down: Hungarian: magyar or magyarul? Slovak: Slovenský? Slovenian: Slovenski? Slovensko? Any help would be gratefully appreciated! Deborah Goldsmith Manager, Fonts Unicode Apple Computer, Inc. [EMAIL PROTECTED]
Re: Language name questions
For the ICU data for those, look at: http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=hu http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=sk http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=sl (Note: there are sometimes oddities with the server; 'refresh' if you get a blank screen.) Mark __ http://www.macchiato.com “Eppur si muove” - Original Message - From: Deborah Goldsmith [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Friday, May 24, 2002 16:04 Subject: Language name questions Hi, I am trying to determine the names of a few languages in their own language. This is for a list of language names that a user can select, like: English Français 日本語 and so on. I need answers to some particular questions, but if someone could point me at a book or web site, then that would be even better. Here are the languages I'm trying to pin down: Hungarian: magyar or magyarul? Slovak: Slovenský? Slovenian: Slovenski? Slovensko? Any help would be gratefully appreciated! Deborah Goldsmith Manager, Fonts Unicode Apple Computer, Inc. [EMAIL PROTECTED]
Re: Language name questions
On Friday, May 24, 2002, at 05:43 PM, Mark Davis wrote: http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=sk This has Slovenina, but we've also seen Slovensk. Deborah
Re: Language name questions
I'll forward it to our localization people, and see what they say. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Deborah Goldsmith [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: Unicode List [EMAIL PROTECTED] Sent: Friday, May 24, 2002 18:26 Subject: Re: Language name questions On Friday, May 24, 2002, at 05:43 PM, Mark Davis wrote: http://oss.software.ibm.com/cgi-bin/icu/lx/en_US/utf-8/?_=sk This has Slovenina, but we've also seen Slovensk. Deborah
Re: Courtyard Codes and the Private Use Area
Michael (michka) Kaplan michka at trigeminal dot com wrote: William, please start thinking of the PUA as the city dump. Everyone is glad it is there when you have to stick something somewhere, but no one really talks about it much and no one *ever* wants to take things out of it and strew it on their nice, clean characters. Oh, that's going a bit far. I think it's more like an attic or basement, a handy place to store your old baseball card collection and other personal items that don't belong anywhere else, but definitely not a place you'd want to invite the neighbors or make the center of attention of your house. -Doug Ewell Fullerton, California
Re: Language name questions
Deborah Goldsmith goldsmit at apple dot com wrote: Here are the languages I'm trying to pin down: Hungarian: magyar or magyarul? Slovak: Slovenský? Slovenian: Slovenski? Slovensko? FWIW, the language menu in my Ericsson T28 World phone offers Magyar (Hungarian), Slovenčina (Slovak), and Slovenski (Slovenian). -Doug Ewell Fullerton, California
N2476 a hoax?
A new JTC1/SC2/WG2 document, ostensibly from the Unicode Technical Committee, was posted on the WG2 web site this past week. The URL is: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2476.pdf This document is so far removed from the stated position of the UTC, and so far below its normal editorial standards, that I believe it was submitted by some other organization that signed the UTC’s name to it as a hoax, perhaps in an attempt to lend it credibility. I’m not normally much of a conspiracy theorist, so I'm admittedly stepping in unfamiliar territory here. The document, N2476, is titled Variants and CJK Unified Ideographs in ISO/IEC 10646-1 and -2, which is quite a broad category. It turns out to be about inventing some sort of equivalence classes among Han characters so that they can be considered the same in certain contexts. Anyone can create this type of equivalence class for their own personal use, of course, but N2476 proposes that the IRG be instructed to develop a classification scheme that would have some sense of being officially sanctioned. This is at odds with what I have heard from the most prominent CJK experts on this list, that such equivalences are too dependent on context and writer's intent to belong in a character encoding standard. For starters, the paper is signed Unicode Techncial Committee. Under what circumstances would any member of the UTC release a paper with the UTC’s own name misspelled? There are other editorial mishaps: ... end-users may want text to be treated is equivalent... There are situations were some users... which are not at all up to the usual standards of a UTC document. But enough nitpicking; it’s the content that really makes me think this document is a spoof. Check this justification for creating an equivalence class between simplified and traditional Han characters: To give one instance which has been of some importance in early 2002, most users want simplified and traditional Chinese to be the same in internationalized domain names. Most users is both overstated and unsubstantiated. Several representatives from the Chinese, Taiwanese, and Hong Kong domain-name industry made this claim on the Internationalized Domain Name (IDN) mailing list. The topic became known simply as TC/SC and, for over a month, was more frequently and persistently discussed than any other topic. It got to the point where the domain-name representatives organized a chain-letter campaign, resulting in over 300 messages -- many identically worded, and from previously silent contributors -- insisting that the IDN architecture must implement TC/SC equivalence or be a complete failure. Latin domain names, after all, are case insensitive. Www.Unicode.Org resolves to the same address as www.unicode.org. UTC members have repeatedly stated that TC/SC equivalence is not at all comparable to Latin case mapping. The inability to provide for [TC/SC equivalence] very nearly prevented Chinese from being used in internationalized domain names. No, it didn’t. That was a counterproposal made by the Chinese domain-name representatives, who claimed that prohibiting Han characters for now would give the relevant bodies more time to develop a proper TC/SC mapping solution (implying that the problem was solvable at all, an opinion disputed by many). Programmers and users are being increasingly frustrated that as ISO/IEC 10646 becomes more pervasive, they are increasingly compelled to deal with a large number of variant characters some of which are only subtly different from each other and which cannot be automatically equated. The UTC would never refer to ISO/IEC 10646 as pervasive or talk of programmers and users being compelled to deal with variant characters, nor would it make such an emotional appeal that such variants should be automatically equated. Note the lack of standard UTC/WG2 terminology; if this were the UTC talking, you would be reading about canonical and compatibility equivalents and normalization. This passage also hints at the author’s lack of awareness that similar equivalence issues exist for scripts other than Han. It is vitally important that data be provided to allow developers, protocols, and other standards to deal with Han variants. I have never before seen an official UTC paper that claimed it was vitally important to solve a given problem. Individual submissions, yes. What is needed, however, is something that allows at the least for a first-order approximation of equivalence it would be up to the authors of the individual application, protocol, or standard to determine whether this were acceptable or not. And what if the authors decide the IRG-developed approach is not acceptable? What are they expected to do then? Again, the reader is invited to contrast this passage, in both form and content, with any other that has been issued from the UTC in the past. On the very same day (2002-05-08) that N2476 was published,