From: "Asmus Freytag" <[EMAIL PROTECTED]> > At 08:03 AM 10/16/03 -0700, Peter Kirk wrote: > >Or perhaps a way can be found to graciously retire UTF-16 in some distant > >future version of Unicode. That is likely to become viable long before the > >extra planes are needed. > > This discussion is a pure numbers game. Since no-one can define a hard > number for a cut-off that's guaranteed to be good 'forever', all we have is > probability. (That's all we have anyway, whether in life or science). So > the question becomes an estimate of probability. > > 128 charaters (ASCII) cover 80% of the characters needed by 5% of the > world's population > 256 characters (Latin-1) covers 80% of the characters needed by 15% of the > world's population > 40,000 characters (Unicode 1.0) covers 95% of the characters needed by 85% > of the worlds population > 90,000 characters (Unicode 4.0) covers 98% of the characters needed by 95% > of the world's population > > Exercise for the reader: > > Warmup: > Where do the other 910,000 characters come from, and who's using them?
We're not discussing about addition of characters standardized by joint efforts of Unicode's UTC and ISO's WG2, and I'm not expecting a lot of changes in this area. But about a more general scheme in which the Unicode/ISO10646 would become a part of a larger set of standards for encoding something else than just pure text. There are already attempts to encode attributed text, and mixing/interleaving text and object data with a unified encoding scheme. For now the inclusion of codepoints like the Object Replacement Character is demonstrating that mixing text and other data in a single unified and serialized stream is already an issue. Of course there's now XML to add structure to this content, but unstructured data also has its applications, everywhere as a predefinite schema cannot be designed. Also, there's some needs to allow designers of glyph libraries to encode them and exchange them, using privately alocated codepoints, without risking collision between each PUA assignments. As PUA characters are not designed to be interchanged, the other solution could be based on private reservation in a global registry similar to reservation in the IPv4 space. Then the codepoint usages can be privately agreed upon between collaborating companies that wish to unify their own codesets, and reduce their assignment (a process similar to IP space aggregation and renumbering, something that has some technical issues but is solvable in a medium term). In fact this interchangeability of PUA codepoints is still an unsolved issue, that could be solved in a way similar to IPv4 assignments under the IANA authority. Nothing needs to be changed for the current 17 planes managed and assigned to Unicode/ISO10646, as long as UTC&WG2 accept that they will not need to manage centrally all character assignments for every limited group. Due to that, there's a big risk that PUAs start being permanently assigned as part of a OS core charset, and that data created on distinct systems become mutually incompatible as they are using colliding subsets of PUAs (this is already the case in core fonts and script processors used in MS Windows, and a few private characters/logographs used by Apple in MacOS). There's a huge number of candidate corporate logographs that could be reserved simply for usage within a unified scheme including Unicode, and that could be negociated within a IANA registry, with a reservation system similar to domain names. In addition, adding such a system could generate some revenues to help finance Unicode and ISO10646 activities: these private assignments become interchangeable as long as their registration is active in the registry. We could even imagine to implement this system within a special domain and use rDNS requests to get a resolved domain name corresponding to an assigned codepoint: this domain could then contain info on how to get glyphs or fonts or information supporting this private codepoint. These glyphs could be protected with digital rights or privacy and could even include registered logos, graphics, designs, ... and even colorful photographs and artworks. I could imagine a lot of other similar applications... This does not contradict the Unicode/ISO10646 goals which is to keep the 17 planes open to everybody use and publicly accessible for global interchanges of information, by a strict policy describing the correct usage of codepoints assigned and unified by ISO's WG2 and Unicode.org's UTC.

