Re: Concise term for non-ASCII Unicode characters

Ken Whistler Tue, 29 Sep 2015 11:57:48 -0700


On 9/29/2015 10:30 AM, Sean Leonard wrote:

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
I would say there's already enough terminology in the Unicode worldto add more to it. This thread already hinted at enough ways ofexpressing what you'd like, the simplest one being "scalar valuesgreater than U+001F". This is the clearest you can come up with andanybody who has basic knowledge of the Unicode standard
Uh...I think you mean U+007F? :)

I agree that "scalar values greater than U+007F" doesn't just trip offthe tongue,and while technically accurate, it is bad terminology -- preciselybecause it

begs the question "wtf are 'scalar values'?!" for the average engineer.

Perhaps it's because I'm writing to the Unicode crowd, but honestlythere are a lot of very intelligent software engineers/standards folkswho do not have the "basic knowledge of the Unicode standard" that isbeing presumed. They want to focus on other parts of their systems orprotocols, and when it comes to the "text part", they just hand-waveand say "Unicode!" and call it a day. ...

Well, from this discussion, and from my experience as an engineer, Ithink this comes downto people in other standards, practices, and protocols dealing with theages old problemof on beyond zebra for characters, where the comfortable assumptionsthat byte=characterbreak down and people have to special case their code and documentation.Where buffersoverrun, where black hat hackers rub their hands in glee, and whereengineers exclaim, "Oh gawd! I

can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much,anyway) would be coolif everybody were using UTF-32, because then at least we'd be back to32-bit-word=character,and the programming would be easier. But UTF-32 doesn't play well withexisting protocolsand APIs and storage and... So instead, we are in the age of "universalUnicode and almost

always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+0000..U+007F. Good because they are all singlebytes in UTF-8and because then UTF-8 strings just work like the Computer Science Godalways intended,

and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10FFFF. Bad because they require multiplebytes to representin UTF-8 and so break all the simple assumptions about string and bufferlength.They make for bugs and more bugs and why oh why do I have to keepdealing withedge cases where character boundaries don't line up with allocatedbuffer boundaries?!!

I think we can agree that there are two types of characters -- and thatthose code point

ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the standardssense of"terminology") -- coming up with usable, clear terms for the two sets.To be goodterminology, the terms have to be identifiable and neither too generic("good characters"and "bad characters") or too abstruse or wordy ("scalar values less thanor equal to U+007F" and

"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" and"multi-byte UTF-8"might work for engineers, but is a confusing distinction, because UTF-8as an encodingform is inherently multi-byte, and such terminology would undermine themeaning of UTF-8

itself.

Finally, to be good terminology, the terms needs to have some reasonablechance ofcatching on and actually being used. It is fairly pointless to have a"standardized way"of distinguishing the #1 and #2 types of characters if people eitherdon't know aboutthat standardized way or find it misleading or not helpful, and insteadcontinue groping

about with their existing ad hoc terms anyway.

In the twenty minutes since my last post, I got two differentresponses...and as you pointed out, there are a lot of ways to expresswhat one would like. I would prefer one, uniform way (hence,"standardized way").


Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+0000..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)

If we just highlight that terminology more prominently, emphasize it in the

Unicode glossary, and promote it relentlessly, it might catch on moregenerally,

and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that

might be catchy enough to go viral -- at least among the protocolwriters andengineers who matter for this. Riffing on the small/big distinction andconnecting

it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a buddingterminologist

out there who could improve on that suggestion!

At any rate, any formal contribution that suggests coming up withterminology forthe #1 and #2 sets should take these considerations under advisement.And unlessit suggests something that would pretty easily gain consensus asdemonstrably better than

the #1 and #2 terms suggested above by Mark, it might not result in any
change in actual usage.

--Ken

Re: Concise term for non-ASCII Unicode characters

Reply via email to