Re: Concise term for non-ASCII Unicode characters

Sean Leonard Tue, 29 Sep 2015 10:37:40 -0700

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:

I would say there's already enough terminology in the Unicode world to add more to it. 
This thread already hinted at enough ways of expressing what you'd like, the simplest one 
being "scalar values greater than U+001F". This is the clearest you can come up 
with and anybody who has basic knowledge of the Unicode standard

Uh...I think you mean U+007F? :)

Perhaps it's because I'm writing to the Unicode crowd, but honestlythere are a lot of very intelligent software engineers/standards folkswho do not have the "basic knowledge of the Unicode standard" that isbeing presumed. They want to focus on other parts of their systems orprotocols, and when it comes to the "text part", they just hand-wave andsay "Unicode!" and call it a day. In particular there is a flow-downeffect where terms from one standards body don't match with anotherstandards body, perhaps because they got redefined over time for variousreasons. The distinction between "characters", "abstract characters","code points", and "scalar values" is not intuitively obvious to peoplewithout specialized knowledge of text processing issues. The fact that(modern implementations of) UTF-8 encoders and decoders are not supposedto process the surrogate code points (arbitrarily), for example, is arather advanced topic that presumes knowledge of the interaction betweenUTF-16, UTF-8, what surrogate code points actually are, and the securityimplications of so-doing (UTR-36). Furthermore one has to parse thedistinction between "well-formed" and "ill-formed".

In the twenty minutes since my last post, I got two differentresponses...and as you pointed out, there are a lot of ways to expresswhat one would like. I would prefer one, uniform way (hence,"standardized way"). Just surveying the various standards that havetried to tackle this distinction with their own organic terminology willprobably be revealing. Evidence-based should be the yardstick.


Best regards,

Sean

Re: Concise term for non-ASCII Unicode characters

Reply via email to