Re: What's the BMP being saved for?

Asmus Freytag Fri, 19 Mar 2004 15:03:45 -0800

At 07:13 AM 3/19/2004, Marion Gunn wrote:

Ar 15:33 +0000 2004/03/18, scríobh Arcane Jill:
>This probably is going to sound like a really dumb question, but ... Is
>the BMP being saved for something?
>...
>Arcane Jill

There are never any dumb questions, Jill, only dumb answers.

And some of the latter deserve to be straightened out a bit.

BMP is part of 10646-speak, and probably part of pre-Unicode terminology.


It used to be, but now this term is mentioned on page 1 of Unicode 4.0
Michael's reply to his had it partially right:

At 10:57 AM 3/19/2004, Michael Everson wrote:

This is incorrect. "BMP" means "Basic Multilingual Plane" and is the name given to the plane designated by the code positions 00000-0FFFF. It is not "10646-speak". It is part of the architectural nomenclature of the Universal Character Set.

Yes, and all that is 10646-speak, in the sense that BMP, UCS etc. are terms from 10646. While it's correct to call The Unicode Standard a universal character set, the title Universal Character Set is that of 10646.

To summarize (telescoping time) so as to get this msg off before returning
to paid work.:-)

The decision to create the BMP dates back to a time when certain software
suppliers were complaining that anthing approaching a full implementation
of ISO 10646 (later transmuted, so to speak, into Unicode) would be too big
for them to handle, and too costly.

The fact that there is a 2-byte form (UCS-2) of 10646 is due to the merger with Unicode, which was conceived of as a 16-bit standard. Before the merger with Unicode, there were 1-byte and 3-byte forms as well, and the 2-byte form was quite a bit different from today's BMP in its basic layout and behavior. For example, vast sections of it could be 'swapped' out to effectively create a C, J, K and U version of the 2-byte form.

The Unicode camp felt that asking the world to support a 32-bit standard to replace the hodgepodge of 8-bit character sets, etc., was asking too much. Their initial belief that one could actually contain a universal character set in 16-bit had begun to crack around that time, as can be witnessed by the creation of UTF's (first UTF-8 and precursors, then UTF-16).

10646 was simplified to a static 2-byte and 4-byte form, later UTF-8 and the surrogates needed for UTF-16 were added, leaving both standards with 3 eqivalent encoding forms, plus the fixed width 16-bit UCS-2 in 10646 only which is not so useful.

Small local groups, such as ours, were then working rapidly and painlessly
mostly on national and international character sets on far smaller scales.

I recall chairing some discussion at a CEN workshop, possibly in Slovenia,
in re something related, at the height of the debate. In any case, by that
time, CEN had already emerged as a big player in this work (I think Unicode
had yet to make much of a mark, but I don't mind if someone corrects me
about that, if wrong, because it really doesn't matter now, in the least).

Anyway, it was agreed to divide ISO 10646 into sections, such as BMP (Basic
Multilingual Plane) and the MES (Minimum European Subset), and my own
company, among others, was very pleased to be hired by CEN to do the
necessary (a truly exciting and rewarding period, when we actually got
_paid_, generously, if belatedly, for such Standards work!)

BMP and MES are certainly both sub-sets, but they are not on equal footing. One is a contiguous sub-set of the code space, lined up with an even power of 2, the other is a discontiguous sub-set of the *characters* in 10646 determined by rather unclear principles to be of use to Europeans.

Is the BMP a reality, actually referenced in software, or scheduled to be
so referenced in future? I doubt it, although I think that would be a very
good thing (just as I believe the 8859 series and the like more practially
useful, even today, as clean-cutting tools, than the full complement of
10646, which remains a rather blunt instrument which creates obstacles in
unflagged text).

The BMP is a handy concept, and a practical tool to organize code point allocations, that's why the term made it from 10646 into Unicode. It approximates the collection of frequently used characters from living scripts; there are some exception to this, viz the Hong Kong ideographs in Plane2 and Runic and Ogham on Plane 0.

What's not so useful is UCS-2. The best use I've found for that term is as descriptive label on software that does not (yet) support supplementary characters; so I'm hoping that use of the term will gradually expire.

Justification for saving the BMP for the purposes originally intended is
probably something the Unicode Consortium would be happy to clarify for
you.

There've been some nice answers by other's on the list who took the time to put them together.

Perhaps that has already been done in some of today's e-mails, which are
too numerous for me to read right now, under pressure of urgent work. (I do
promise to try to read them all.) If you want more info on the purpose and
genesis of the BMP,

A little known fact is that representatives of Unicode participated in the work on, and review of 10646 long before the merger. We have no need to study anyone else's archives ;-).

A./

I suggest that you ask NSAI to let you study the
archives of NSAI/AGITS/WG6 (later transmuted into NSAI/ICTSCC/SC4), or thou
send a simple query directly to CEN (on whose live agenda such matters
remain, I believe).

Hope this helps,
mg

ps.
Would someone just hit reply to this msg, to time our comms here? There
seems to be a long timelag between sending and delivery of Unicode list
msgs, sometimes.
mg


--
Marion Gunn * EGTeo (Estab.1991)
27 Páirc an Fhéithlinn, Baile an
Bhóthair, Co. Átha Cliath, Éire.
* [EMAIL PROTECTED] * [EMAIL PROTECTED] *

Re: What's the BMP being saved for?

Reply via email to