On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
I would say there's already enough terminology in the Unicode world to add more to it. 
This thread already hinted at enough ways of expressing what you'd like, the simplest one 
being "scalar values greater than U+001F". This is the clearest you can come up 
with and anybody who has basic knowledge of the Unicode standard
Uh...I think you mean U+007F? :)

Perhaps it's because I'm writing to the Unicode crowd, but honestly there are a lot of very intelligent software engineers/standards folks who do not have the "basic knowledge of the Unicode standard" that is being presumed. They want to focus on other parts of their systems or protocols, and when it comes to the "text part", they just hand-wave and say "Unicode!" and call it a day. In particular there is a flow-down effect where terms from one standards body don't match with another standards body, perhaps because they got redefined over time for various reasons. The distinction between "characters", "abstract characters", "code points", and "scalar values" is not intuitively obvious to people without specialized knowledge of text processing issues. The fact that (modern implementations of) UTF-8 encoders and decoders are not supposed to process the surrogate code points (arbitrarily), for example, is a rather advanced topic that presumes knowledge of the interaction between UTF-16, UTF-8, what surrogate code points actually are, and the security implications of so-doing (UTR-36). Furthermore one has to parse the distinction between "well-formed" and "ill-formed".

In the twenty minutes since my last post, I got two different responses...and as you pointed out, there are a lot of ways to express what one would like. I would prefer one, uniform way (hence, "standardized way"). Just surveying the various standards that have tried to tackle this distinction with their own organic terminology will probably be revealing. Evidence-based should be the yardstick.

Best regards,

Sean

Reply via email to