On 9/29/2015 10:30 AM, Sean Leonard wrote:
On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
I would say there's already enough terminology in the Unicode world to add more to it. This thread already hinted at enough ways of expressing what you'd like, the simplest one being "scalar values greater than U+001F". This is the clearest you can come up with and anybody who has basic knowledge of the Unicode standard
Uh...I think you mean U+007F? :)

I agree that "scalar values greater than U+007F" doesn't just trip off the tongue, and while technically accurate, it is bad terminology -- precisely because it
begs the question "wtf are 'scalar values'?!" for the average engineer.


Perhaps it's because I'm writing to the Unicode crowd, but honestly there are a lot of very intelligent software engineers/standards folks who do not have the "basic knowledge of the Unicode standard" that is being presumed. They want to focus on other parts of their systems or protocols, and when it comes to the "text part", they just hand-wave and say "Unicode!" and call it a day. ...

Well, from this discussion, and from my experience as an engineer, I think this comes down to people in other standards, practices, and protocols dealing with the ages old problem of on beyond zebra for characters, where the comfortable assumptions that byte=character break down and people have to special case their code and documentation. Where buffers overrun, where black hat hackers rub their hands in glee, and where engineers exclaim, "Oh gawd! I
can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much, anyway) would be cool if everybody were using UTF-32, because then at least we'd be back to 32-bit-word=character, and the programming would be easier. But UTF-32 doesn't play well with existing protocols and APIs and storage and... So instead, we are in the age of "universal Unicode and almost
always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+0000..U+007F. Good because they are all single bytes in UTF-8 and because then UTF-8 strings just work like the Computer Science God always intended,
and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10FFFF. Bad because they require multiple bytes to represent in UTF-8 and so break all the simple assumptions about string and buffer length. They make for bugs and more bugs and why oh why do I have to keep dealing with edge cases where character boundaries don't line up with allocated buffer boundaries?!!

I think we can agree that there are two types of characters -- and that those code point
ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the standards sense of "terminology") -- coming up with usable, clear terms for the two sets. To be good terminology, the terms have to be identifiable and neither too generic ("good characters" and "bad characters") or too abstruse or wordy ("scalar values less than or equal to U+007F" and
"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" and "multi-byte UTF-8" might work for engineers, but is a confusing distinction, because UTF-8 as an encoding form is inherently multi-byte, and such terminology would undermine the meaning of UTF-8
itself.

Finally, to be good terminology, the terms needs to have some reasonable chance of catching on and actually being used. It is fairly pointless to have a "standardized way" of distinguishing the #1 and #2 types of characters if people either don't know about that standardized way or find it misleading or not helpful, and instead continue groping
about with their existing ad hoc terms anyway.


In the twenty minutes since my last post, I got two different responses...and as you pointed out, there are a lot of ways to express what one would like. I would prefer one, uniform way (hence, "standardized way").

Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+0000..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)

If we just highlight that terminology more prominently, emphasize it in the
Unicode glossary, and promote it relentlessly, it might catch on more generally,
and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol writers and engineers who matter for this. Riffing on the small/big distinction and connecting
it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a budding terminologist
out there who could improve on that suggestion!

At any rate, any formal contribution that suggests coming up with terminology for the #1 and #2 sets should take these considerations under advisement. And unless it suggests something that would pretty easily gain consensus as demonstrably better than
the #1 and #2 terms suggested above by Mark, it might not result in any
change in actual usage.

--Ken



Reply via email to