RE: more flexible pipeline for new scripts and characters
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Doug Ewell This is one of the things the PUA is for. Unfortunately, it has become very popular to tell people to stay away from the PUA, that it is evil and unsuitable for any sort of interchange That's an overstatement. Implementing PUA can be very problematic _in some scenarios_. For instance, suppose an OS vendor were to implement PUA characters for thousands of ideographs, in effect assuming that a large portion of PUA were those ideographs. That would lead to a number of problems, including the following: - users interested in using a PUA character for some other purpose would have problems - data would not interoperate between that OS and other platforms - if those characters later are later added to Unicode, users, app developers and the OS vendor have to deal with problems of data using alternate representations At the opposite extreme, suppose an individual user or app developer needs to represent something as a PUA character and there is no broad interchange of data using that PUA character, then none of the problems mentioned above arise. Of course, in between there is a range of scenarios involving varying degrees of data interchange. The risks will vary depending on the scenario. Anyone considering use of PUA in that case should evaluate the potential risks and costs of their options. Peter
Re: more flexible pipeline for new scripts and characters
I agree completely: there are some situations, or combinations of situations, where using the PUA and interchanging PUA data will certainly cause problems, and some where it will not, and in fact can be quite helpful. However, the explanation Peter laid out is often not stated publicly, but boiled down to simplistic edicts like Avoid the PUA. Many examples can be found, on this list and elsewhere. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell -Original Message- From: Peter Constable Sent: Sunday, November 20, 2011 12:28 To: unicode@unicode.org Subject: RE: more flexible pipeline for new scripts and characters From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Doug Ewell This is one of the things the PUA is for. Unfortunately, it has become very popular to tell people to stay away from the PUA, that it is evil and unsuitable for any sort of interchange That's an overstatement. Implementing PUA can be very problematic _in some scenarios_. For instance, suppose an OS vendor were to implement PUA characters for thousands of ideographs, in effect assuming that a large portion of PUA were those ideographs. That would lead to a number of problems, including the following: - users interested in using a PUA character for some other purpose would have problems - data would not interoperate between that OS and other platforms - if those characters later are later added to Unicode, users, app developers and the OS vendor have to deal with problems of data using alternate representations At the opposite extreme, suppose an individual user or app developer needs to represent something as a PUA character and there is no broad interchange of data using that PUA character, then none of the problems mentioned above arise. Of course, in between there is a range of scenarios involving varying degrees of data interchange. The risks will vary depending on the scenario. Anyone considering use of PUA in that case should evaluate the potential risks and costs of their options. Peter
Re: more flexible pipeline for new scripts and characters
On 11/16/2011 07:25 AM, Asmus Freytag wrote: Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to publish something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be most useful whenever there's no possibility of correcting some decision taking in standardization. Unicode knows the concept of a provisional property, which works roughly in the manner you suggested. However, for certain types of information to be standardized, in particular the code allocation and character names, it would be rather problematic to have extended provisional status. The reason is that once something is exposed in an implementation, it enables users to create documents. These documents would all have to be provisional, because they would become obsolete once a final (corrected or improved) code allocation were made. The whole reason that some aspects of character encoding are write once (can never be changed) is to prevent such obsolete data in documents. Therefore, the only practical way is that of having a bright line between proposed allocations (that are not implemented and are under discussion) and final, published allocations that anyone may use. Instead of a provisional status, the answer would seem to lie in making the details of proposed allocations more accessible for review during the period where they are under consideration and balloting in the standardization committee. One possible way to do that would be to make repertoire additions subject to the Public Review process. Another would be for more interested people to become members and to follow submissions as soon as they hit the Unicode document registry. The former is much more labor-intensive and I suspect not something the Consortium could easily manage with the existing funding and resources. The latter would have the incidental benefit of adding to the funding for the work of the Consortium by providing some additional funding via from membership fees. A./ How is this different from Named sequences, which are published provisionally?
Re: more flexible pipeline for new scripts and characters
On 11/18/2011 1:30 PM, Karl Williamson wrote: How is this different from Named sequences, which are published provisionally? Named sequences aren't character properties. When a newly encoded character is published in the standard, its code point, its name, and dozens of other properties all have to be published at the same time. The whole notion of omitting any of them would cause problems for implementers and would be tantamount to saying that the character isn't actually standard yet, because properties for it are missing. And for good reasons, *some* (but not all) of those properties are also immutable upon publication. The most obvious is the code point, of course. Changing a code point for an encoded character after it is published in the standard is tantamount to admitting it was never standard in the first place. In the early days of Unicode (and 10646, for that matter), the committees entertained the notion that character names might be the kind of thing which could occasionally get corrected later, as needed, after publication. But after several notorious examples of the undesirability and costs associated with changing character names after publication, the committees slammed the door on that, and character *names* are now as immutable as their code points. Named sequences are different. Publishing a newly encoded character has no implications whatsoever for named sequences. A named sequence stands on its own, as an independent entity. Furthermore, there basically are no algorithms (or implementations) that depend on them in any significant way. Named sequences are primarily epicycles of the character encoding process -- they give standard names to things that people want to have names for, but which the committees decline to encode as characters, because they can already be represented by sequences of existing characters. Given that status, and given that named sequences are *not* character properties, it was possible to create a two-staged, provisional publication mechanism for them, publishing them first as a provisional list, and then later, if nobody has any objections of corrections, moving them into the (immutable) standard list. You just can't do that with character *names*. If you want to make analogies, however, the ISO ballots constitute the *provisional* publication for character code points and names. If nobody has any objections or corrections expressed during the ballotting process (which can continue for 2 years or longer), then eventually those code points and names get moved into the (immutable) list in the standard. --Ken
Re: more flexible pipeline for new scripts and characters
On 11/18/2011 1:30 PM, Karl Williamson wrote: On 11/16/2011 07:25 AM, Asmus Freytag wrote: The whole reason that some aspects of character encoding are write once (can never be changed) is to prevent such obsolete data in documents. How is this different from Named sequences, which are published provisionally? Named sequences are a special case. The sequence as such exists, whether or not a name is defined for it. Therefore, ordinary users can go about their business creating documents containing character sequences without needing to know whether a sequence is named or not. Those users (programmers) that use these names in place of identifiers can be expected to understand what provisional means and to be aware of the penalties for implementing them in ways that can't later be upgraded. Perl should probably not support them in regex notation, for example. So, in all respects, these act more like ordinary properties, for which provisional information is already supported in the UCD. (Mostly for Unihan). A./
Re: more flexible pipeline for new scripts and characters
On 11/18/2011 3:06 PM, Ken Whistler wrote: On 11/18/2011 1:30 PM, Karl Williamson wrote: How is this different from Named sequences, which are published provisionally? Named sequences aren't character properties. The provide information about characters in context - in that sense they are similar to many other properties, even if most of them can be mapped to single character codes (with the contextual behavior left to algorithms and rules). That is not to detract from your main point, with which I fully agree, that this puts them into the realm of information that is not required to be defined for a character to minimally defined or and that needs to be available from day one for a character to be implementable at all (such as decomp mappings, bidi class, code point, name, etc.). If you want to make analogies, however, the ISO ballots constitute the *provisional* publication for character code points and names. If nobody has any objections or corrections expressed during the ballotting process (which can continue for 2 years or longer), then eventually those code points and names get moved into the (immutable) list in the standard. Good point. If it would be manageable, I would recommend for Unicode to have a public review process on its own for character proposals, so as to elicit broader public review before data is finalized for publication. In the Unicode process, there's a public beta, but that is useful only to spot mistakes in the publishing process - it's usually too late to fix substantial mistakes of any kind. A./
Re: more flexible pipeline for new scripts and characters
Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to publish something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be most useful whenever there's no possibility of correcting some decision taking in standardization. Unicode knows the concept of a provisional property, which works roughly in the manner you suggested. However, for certain types of information to be standardized, in particular the code allocation and character names, it would be rather problematic to have extended provisional status. The reason is that once something is exposed in an implementation, it enables users to create documents. These documents would all have to be provisional, because they would become obsolete once a final (corrected or improved) code allocation were made. The whole reason that some aspects of character encoding are write once (can never be changed) is to prevent such obsolete data in documents. Therefore, the only practical way is that of having a bright line between proposed allocations (that are not implemented and are under discussion) and final, published allocations that anyone may use. Instead of a provisional status, the answer would seem to lie in making the details of proposed allocations more accessible for review during the period where they are under consideration and balloting in the standardization committee. One possible way to do that would be to make repertoire additions subject to the Public Review process. Another would be for more interested people to become members and to follow submissions as soon as they hit the Unicode document registry. The former is much more labor-intensive and I suspect not something the Consortium could easily manage with the existing funding and resources. The latter would have the incidental benefit of adding to the funding for the work of the Consortium by providing some additional funding via from membership fees. A./
Re: more flexible pipeline for new scripts and characters
I guess what I'm proposing is that the proposed allocations be implemented, so that problems may be unearthed, even as the users accept that the standard is still only provisional. On Wed, Nov 16, 2011 at 3:25 PM, Asmus Freytag asm...@ix.netcom.com wrote: Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to publish something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be most useful whenever there's no possibility of correcting some decision taking in standardization. Unicode knows the concept of a provisional property, which works roughly in the manner you suggested. However, for certain types of information to be standardized, in particular the code allocation and character names, it would be rather problematic to have extended provisional status. The reason is that once something is exposed in an implementation, it enables users to create documents. These documents would all have to be provisional, because they would become obsolete once a final (corrected or improved) code allocation were made. The whole reason that some aspects of character encoding are write once (can never be changed) is to prevent such obsolete data in documents. Therefore, the only practical way is that of having a bright line between proposed allocations (that are not implemented and are under discussion) and final, published allocations that anyone may use. Instead of a provisional status, the answer would seem to lie in making the details of proposed allocations more accessible for review during the period where they are under consideration and balloting in the standardization committee. One possible way to do that would be to make repertoire additions subject to the Public Review process. Another would be for more interested people to become members and to follow submissions as soon as they hit the Unicode document registry. The former is much more labor-intensive and I suspect not something the Consortium could easily manage with the existing funding and resources. The latter would have the incidental benefit of adding to the funding for the work of the Consortium by providing some additional funding via from membership fees. A./
Re: more flexible pipeline for new scripts and characters
On 11/16/2011 6:37 AM, Peter Cyrus wrote: I guess what I'm proposing is that the proposed allocations be implemented, so that problems may be unearthed, even as the users accept that the standard is still only provisional. Where users are programmers, such as is the case with certain properties, such niceties are more or less understood by all parties involved. Where users are the public, as would be the case with provisional implementations, you run into more issues. Not many users are in the business of creating test data that can be thrown away. Most expect any implementation to be faithful (forever) to their data. Second, absent a firm timeline in standardization (which prevents bad proposals from being held back indefinitely) implementers would not know when they can move their provisional implementations to final status for a given script. Most implementations support more than a single script, which would mix provisional and non-provisional data. Test implementations can be built any time, and whether you base them on draft documents under ballot or provisional allocations under some more formal scheme really makes no difference. (There's been a long-standing suggestion that people test characters or scripts using the private use area. This seems to not be favored, again, because all data created under such scheme are obsolete, once a final encoding comes out.). What would make a difference would be the ability to have some scripts exist in a provisional state for really extended periods, to allow all sorts of issues to be discovered in realistic use. That, however, runs into the problem that users really tend to be impatient. Once functional implementations exist, they want to create real data. So far. for the vast majority of characters, the existing system has proven workable. There are a small number of mistakes that are discovered too late to be fixed invisibly, leaving a trail of deprecated characters or formal aliases for character names. Overall, the number of these is rather small, given the sheer size of Unicode, even if one or the other recent example appears to warrant more systematic action. A./
RE: more flexible pipeline for new scripts and characters
Peter Cyrus pcyrus at alivox dot net wrote: In other words, people could propose a new script or character and rather than have it discussed before encoding and then encoded in permanence, with no possibility even to correct obvious errors as in U+FE18, instead it would be provisionally accepted but still subject to modifications as implementors worked with it. Hopefully, most mistakes would be unearthed early and corrections applied before much text had been encoded. As time passed and the encoding became more stable, the size of mistake open to correction would be reduced, e.g. to spelling errors, until it was frozen as a result of this process before being declared permanent. As Asmus points out, users tend to want to jump the gun and start using anything that appears to be even provisionally approved. Look at all the health warnings that UTC has to include on the Pipeline and beta-review pages. My thought is that some of the problems that I've seen discussed might have been discovered and addressed had a community been using the proposed standard before it became immutable. In the current process, that transition may occur too early to be useful. It may be easier to fix all the existing text if very little time has passed, than to fix all future text forever. Spelling errors like BRAKCET and name errors like LATIN LETTER OI really don't tend to matter much, in the real world. We do talk about them a lot. This idea could also be extended to new characters and scripts that might or might not make it into Unicode : Unicode could offer a provisional acceptance that allowed users to demonstrate the utility of the proposed changes once they're in Unicode, even if they're later modified or withdrawn. This is one of the things the PUA is for. Unfortunately, it has become very popular to tell people to stay away from the PUA, that it is evil and unsuitable for any sort of interchange, and so people tend to look for alternative solutions which shouldn't be necessary. This policy might have prevented the recoding of Tengwar, Cirth, Shavian, Phaistos Disc and Deseret as they moved from the PUA to the SMP. The PUA is a kind of sandbox for encoding experimentation. For exactly the reasons you give elsewhere, there was no guarantee that Shavian and Phaistos Disc and Deseret would be encoded exactly as they were found in the ConScript Unicode Registry -- indeed, the layout of the Deseret block was different. They would have had to be recoded anyway. The same is true for Tengwar and Cirth (which, by the way, have not been approved or even reconsidered recently). -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell