RE: more flexible pipeline for new scripts and characters

2011-11-20 Thread Peter Constable
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Doug Ewell

 This is one of the things the PUA is for.  Unfortunately, it has become 
 very popular to tell people to stay away from the PUA, that it is evil 
 and unsuitable for any sort of interchange

That's an overstatement. Implementing PUA can be very problematic _in some 
scenarios_. For instance, suppose an OS vendor were to implement PUA characters 
for thousands of ideographs, in effect assuming that a large portion of PUA 
were those ideographs. That would lead to a number of problems, including the 
following:

- users interested in using a PUA character for some other purpose would have 
problems

- data would not interoperate between that OS and other platforms

- if those characters later are later added to Unicode, users, app developers 
and the OS vendor have to deal with problems of data using alternate 
representations

At the opposite extreme, suppose an individual user or app developer needs to 
represent something as a PUA character and there is no broad interchange of 
data using that PUA character, then none of the problems mentioned above arise.

Of course, in between there is a range of scenarios involving varying degrees 
of data interchange. The risks will vary depending on the scenario. Anyone 
considering use of PUA in that case should evaluate the potential risks and 
costs of their options. 



Peter




Re: more flexible pipeline for new scripts and characters

2011-11-20 Thread Doug Ewell
I agree completely: there are some situations, or combinations of 
situations, where using the PUA and interchanging PUA data will 
certainly cause problems, and some where it will not, and in fact can be 
quite helpful.


However, the explanation Peter laid out is often not stated publicly, 
but boiled down to simplistic edicts like Avoid the PUA.  Many 
examples can be found, on this list and elsewhere.


--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­

-Original Message- 
From: Peter Constable

Sent: Sunday, November 20, 2011 12:28
To: unicode@unicode.org
Subject: RE: more flexible pipeline for new scripts and characters

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
Behalf Of Doug Ewell


This is one of the things the PUA is for.  Unfortunately, it has 
become

very popular to tell people to stay away from the PUA, that it is evil
and unsuitable for any sort of interchange


That's an overstatement. Implementing PUA can be very problematic _in 
some scenarios_. For instance, suppose an OS vendor were to implement 
PUA characters for thousands of ideographs, in effect assuming that a 
large portion of PUA were those ideographs. That would lead to a number 
of problems, including the following:


- users interested in using a PUA character for some other purpose would 
have problems


- data would not interoperate between that OS and other platforms

- if those characters later are later added to Unicode, users, app 
developers and the OS vendor have to deal with problems of data using 
alternate representations


At the opposite extreme, suppose an individual user or app developer 
needs to represent something as a PUA character and there is no broad 
interchange of data using that PUA character, then none of the problems 
mentioned above arise.


Of course, in between there is a range of scenarios involving varying 
degrees of data interchange. The risks will vary depending on the 
scenario. Anyone considering use of PUA in that case should evaluate the 
potential risks and costs of their options.




Peter





Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Karl Williamson

On 11/16/2011 07:25 AM, Asmus Freytag wrote:

Peter,

in principle, the idea of a provisional status is a useful concept
whenever one wants to publish something based on potentially doubtful
or possibly incomplete information. And you are correct, that, in
principle, such an approach could be most useful whenever there's no
possibility of correcting some decision taking in standardization.

Unicode knows the concept of a provisional property, which works roughly
in the manner you suggested. However, for certain types of information
to be standardized, in particular the code allocation and character
names, it would be rather problematic to have extended provisional
status. The reason is that once something is exposed in an
implementation, it enables users to create documents. These documents
would all have to be provisional, because they would become obsolete
once a final (corrected or improved) code allocation were made.

The whole reason that some aspects of character encoding are write
once (can never be changed) is to prevent such obsolete data in documents.

Therefore, the only practical way is that of having a bright line
between proposed allocations (that are not implemented and are under
discussion) and final, published allocations that anyone may use.
Instead of a provisional status, the answer would seem to lie in making
the details of proposed allocations more accessible for review during
the period where they are under consideration and balloting in the
standardization committee.

One possible way to do that would be to make repertoire additions
subject to the Public Review process.

Another would be for more interested people to become members and to
follow submissions as soon as they hit the Unicode document registry.

The former is much more labor-intensive and I suspect not something the
Consortium could easily manage with the existing funding and resources.
The latter would have the incidental benefit of adding to the funding
for the work of the Consortium by providing some additional funding via
from membership fees.

A./



How is this different from Named sequences, which are published 
provisionally?




Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Ken Whistler

On 11/18/2011 1:30 PM, Karl Williamson wrote:
How is this different from Named sequences, which are published 
provisionally?


Named sequences aren't character properties.

When a newly encoded character is published in the standard, its code point,
its name, and dozens of other properties all have to be published at the 
same

time. The whole notion of omitting any of them would cause problems for
implementers and would be tantamount to saying that the character isn't
actually standard yet, because properties for it are missing.

And for good reasons, *some* (but not all) of those properties are also 
immutable upon

publication. The most obvious is the code point, of course. Changing a code
point for an encoded character after it is published in the standard is 
tantamount

to admitting it was never standard in the first place.

In the early days of Unicode (and 10646, for that matter), the committees
entertained the notion that character names might be the kind of thing
which could occasionally get corrected later, as needed, after 
publication. But after

several notorious examples of the undesirability and costs associated with
changing character names after publication, the committees slammed the
door on that, and character *names* are now as immutable as their code 
points.


Named sequences are different. Publishing a newly encoded character has
no implications whatsoever for named sequences. A named sequence stands
on its own, as an independent entity. Furthermore, there basically are no
algorithms (or implementations) that depend on them in any significant way.
Named sequences are primarily epicycles of the character encoding 
process -- they give

standard names to things that people want to have names for, but which
the committees decline to encode as characters, because they can already
be represented by sequences of existing characters.

Given that status, and given that named sequences are *not* character 
properties,
it was possible to create a two-staged, provisional publication 
mechanism for
them, publishing them first as a provisional list, and then later, if 
nobody has any

objections of corrections, moving them into the (immutable) standard list.

You just can't do that with character *names*.

If you want to make analogies, however, the ISO ballots constitute the 
*provisional* publication
for character code points and names. If nobody has any objections or 
corrections
expressed during the ballotting process (which can continue for 2 years 
or longer),

then eventually those code points and names get moved into the (immutable)
list in the standard.

--Ken




Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Asmus Freytag

On 11/18/2011 1:30 PM, Karl Williamson wrote:

On 11/16/2011 07:25 AM, Asmus Freytag wrote:



The whole reason that some aspects of character encoding are write
once (can never be changed) is to prevent such obsolete data in 
documents.






How is this different from Named sequences, which are published 
provisionally?




Named sequences are a special case.

The sequence as such exists, whether or not a name is defined for it.

Therefore, ordinary users can go about their business creating documents 
containing character sequences without needing to know whether a 
sequence is named or not.


Those users (programmers) that use these names in place of identifiers 
can be expected to understand what provisional means and to be aware 
of the penalties for implementing them in ways that can't later be upgraded.


Perl should probably not support them in regex notation, for example.

So, in all respects, these act more like ordinary properties, for which 
provisional information is already supported in the UCD. (Mostly for 
Unihan).


A./



Re: more flexible pipeline for new scripts and characters

2011-11-18 Thread Asmus Freytag

On 11/18/2011 3:06 PM, Ken Whistler wrote:

On 11/18/2011 1:30 PM, Karl Williamson wrote:
How is this different from Named sequences, which are published 
provisionally?


Named sequences aren't character properties.


The provide information about characters in context - in that sense they 
are similar to many other properties, even if most of them can be mapped 
to single character codes (with the contextual behavior left to 
algorithms and rules).


That is not to detract from your main point, with which I fully agree, 
that this puts them into the realm of information that is not required 
to be defined for a character to minimally defined or and that needs to 
be available from day one for a character to be implementable at all 
(such as decomp mappings, bidi class, code point, name, etc.).




If you want to make analogies, however, the ISO ballots constitute the 
*provisional* publication
for character code points and names. If nobody has any objections or 
corrections
expressed during the ballotting process (which can continue for 2 
years or longer),
then eventually those code points and names get moved into the 
(immutable)

list in the standard.



Good point.

If it would be manageable, I would recommend for Unicode to have a 
public review process on its own for character proposals, so as to 
elicit broader public review before data is finalized for publication. 
In the Unicode process, there's a public beta, but that is useful only 
to spot mistakes in the publishing process - it's usually too late to 
fix substantial mistakes of any kind.


A./



Re: more flexible pipeline for new scripts and characters

2011-11-16 Thread Asmus Freytag

Peter,

in principle, the idea of a provisional status is a useful concept 
whenever one wants to publish something based on potentially doubtful 
or possibly incomplete information. And you are correct, that, in 
principle, such an approach could be most useful whenever there's no 
possibility of correcting some decision taking in standardization.


Unicode knows the concept of a provisional property, which works roughly 
in the manner you suggested. However, for certain types of information 
to be standardized, in particular the code allocation and character 
names, it would be rather problematic to have extended provisional 
status. The reason is that once something is exposed in an 
implementation, it enables users to create documents. These documents 
would all have to be provisional, because they would become obsolete 
once a final (corrected or improved) code allocation were made.


The whole reason that some aspects of character encoding are write 
once (can never be changed) is to prevent such obsolete data in documents.


Therefore, the only practical way is that of having a bright line 
between proposed allocations (that are not implemented and are under 
discussion) and final, published allocations that anyone may use. 
Instead of a provisional status, the answer would seem to lie in making 
the details of proposed allocations more accessible for review during 
the period where they are under consideration and balloting in the 
standardization committee.


One possible way to do that would be to make repertoire additions 
subject to the Public Review process.


Another would be for more interested people to become members and to 
follow submissions as soon as they hit the Unicode document registry.


The former is much more labor-intensive and I suspect not something the 
Consortium could easily manage with the existing funding and resources. 
The latter would have the incidental benefit of adding to the funding 
for the work of the Consortium by providing some additional funding via 
from membership fees.


A./



Re: more flexible pipeline for new scripts and characters

2011-11-16 Thread Peter Cyrus
I guess what I'm proposing is that the proposed allocations be implemented,
so that problems may be unearthed, even as the users accept that the
standard is still only provisional.

On Wed, Nov 16, 2011 at 3:25 PM, Asmus Freytag asm...@ix.netcom.com wrote:

 Peter,

 in principle, the idea of a provisional status is a useful concept
 whenever one wants to publish something based on potentially doubtful or
 possibly incomplete information. And you are correct, that, in principle,
 such an approach could be most useful whenever there's no possibility of
 correcting some decision taking in standardization.

 Unicode knows the concept of a provisional property, which works roughly
 in the manner you suggested. However, for certain types of information to
 be standardized, in particular the code allocation and character names, it
 would be rather problematic to have extended provisional status. The reason
 is that once something is exposed in an implementation, it enables users to
 create documents. These documents would all have to be provisional,
 because they would become obsolete once a final (corrected or improved)
 code allocation were made.

 The whole reason that some aspects of character encoding are write once
 (can never be changed) is to prevent such obsolete data in documents.

 Therefore, the only practical way is that of having a bright line between
 proposed allocations (that are not implemented and are under discussion)
 and final, published allocations that anyone may use. Instead of a
 provisional status, the answer would seem to lie in making the details of
 proposed allocations more accessible for review during the period where
 they are under consideration and balloting in the standardization committee.

 One possible way to do that would be to make repertoire additions subject
 to the Public Review process.

 Another would be for more interested people to become members and to
 follow submissions as soon as they hit the Unicode document registry.

 The former is much more labor-intensive and I suspect not something the
 Consortium could easily manage with the existing funding and resources. The
 latter would have the incidental benefit of adding to the funding for the
 work of the Consortium by providing some additional funding via from
 membership fees.

 A./



Re: more flexible pipeline for new scripts and characters

2011-11-16 Thread Asmus Freytag

On 11/16/2011 6:37 AM, Peter Cyrus wrote:
I guess what I'm proposing is that the proposed allocations be 
implemented, so that problems may be unearthed, even as the users 
accept that the standard is still only provisional.


Where users are programmers, such as is the case with certain 
properties, such niceties are more or less understood by all parties 
involved. Where users are the public, as would be the case with 
provisional implementations, you run into more issues.


Not many users are in the business of creating test data that can be 
thrown away. Most expect any implementation to be faithful (forever) to 
their data. Second, absent a firm timeline in standardization (which 
prevents bad proposals from being held back indefinitely) implementers 
would not know when they can move their provisional implementations to 
final status for a given script.


Most implementations support more than a single script, which would mix 
provisional and non-provisional data.


Test implementations can be built any time, and whether you base them on 
draft documents under ballot or provisional allocations under some more 
formal scheme really makes no difference. (There's been a long-standing 
suggestion that people test characters or scripts using the private 
use area. This seems to not be favored, again, because all data created 
under such scheme are obsolete, once a final encoding comes out.).


What would make a difference would be the ability to have some scripts 
exist in a provisional state for really extended periods, to allow all 
sorts of issues to be discovered in realistic use. That, however, runs 
into the problem that users really tend to be impatient. Once 
functional implementations exist, they want to create real data.


So far. for the vast majority of characters, the existing system has 
proven workable. There are a small number of mistakes that are 
discovered too late to be fixed invisibly, leaving a trail of 
deprecated characters or formal aliases for character names.


Overall, the number of these is rather small, given the sheer size of 
Unicode, even if one or the other recent example appears to warrant more 
systematic action.


A./


RE: more flexible pipeline for new scripts and characters

2011-11-16 Thread Doug Ewell
Peter Cyrus pcyrus at alivox dot net wrote:

 In other words, people could propose a new script or character and
 rather than have it discussed before encoding and then encoded in
 permanence, with no possibility even to correct obvious errors as in
 U+FE18, instead it would be provisionally accepted but still subject
 to modifications as implementors worked with it.  Hopefully, most
 mistakes would be unearthed early and corrections applied before much
 text had been encoded.  As time passed and the encoding became more
 stable, the size of mistake open to correction would be reduced, e.g.
 to spelling errors, until it was frozen as a result of this process
 before being declared permanent.

As Asmus points out, users tend to want to jump the gun and start using
anything that appears to be even provisionally approved.  Look at all
the health warnings that UTC has to include on the Pipeline and
beta-review pages.

 My thought is that some of the problems that I've seen discussed might
 have been discovered and addressed had a community been using the
 proposed standard before it became immutable.  In the current process,
 that transition may occur too early to be useful.  It may be easier to
 fix all the existing text if very little time has passed, than to
 fix all future text forever.

Spelling errors like BRAKCET and name errors like LATIN LETTER OI really
don't tend to matter much, in the real world.  We do talk about them a
lot.

 This idea could also be extended to new characters and scripts that
 might or might not make it into Unicode : Unicode could offer a
 provisional acceptance that allowed users to demonstrate the utility
 of the proposed changes once they're in Unicode, even if they're later
 modified or withdrawn.

This is one of the things the PUA is for.  Unfortunately, it has become
very popular to tell people to stay away from the PUA, that it is evil
and unsuitable for any sort of interchange, and so people tend to look
for alternative solutions which shouldn't be necessary.

 This policy might have prevented the recoding of Tengwar, Cirth,
 Shavian, Phaistos Disc and Deseret as they moved from the PUA to the
 SMP.

The PUA is a kind of sandbox for encoding experimentation.  For exactly
the reasons you give elsewhere, there was no guarantee that Shavian and
Phaistos Disc and Deseret would be encoded exactly as they were found in
the ConScript Unicode Registry -- indeed, the layout of the Deseret
block was different.  They would have had to be recoded anyway.  The
same is true for Tengwar and Cirth (which, by the way, have not been
approved or even reconsidered recently).

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­