On 9/21/2015 5:17 PM, Peter Constable wrote:
If you think it's a serious problem that there isn't one conventional term for "characters outside the ASCII repertoire" or "UTF-8 multi-code-unit encoded representations" (since different authors could devise different terminology solutions), then I suggest you submit a document to UTC explaining why it's a problem, documenting inconsistent or unclear terminology that's been used in some standards / public specifications, and requesting that Unicode formally define terminology for these concepts. I can't guarantee that UTC will do it, but I can predict with confidence that it _won't_ do anything of that nature if nobody submits such a document. Peter

I am of the mind to do just that, then. I have seen different documents, standards, and standards bodies that have invented terminology around this term, and they are not always the same. Since these standards depend on Unicode, it would make a lot of sense for Unicode formally to define terminology for these concepts. With the proliferation of UTF-8 (among other things), the boundary between 0x7F - 0x80 is more significant than the boundary between 0xFFFF - 0x10000.

Since this will be my first submission I would appreciate a co-author on this topic. Is anyone willing to help? Thanks in advance. Also, it is not clear if such a document is destined to become a Unicode Technical Report (UTR / PDUTR etc.), or if it should just be an informal write-up. I am guessing this is supposed to be somewhat informal but at the same time it (or the results of it) ought to appear in the UTC Document Search.

The current terminology that I am considering pursuing is "beyond ASCII", in various permutations, such as "beyond the ASCII range", "characters beyond ASCII", "code points beyond ASCII", etc. The term "beyond" implies a certain directionality, and to that extent, implies the Unicode repertoire as well as a Unicode encoding. We have seen on this list the blackflips required to clarify "non-ASCII", since things that are not ASCII literally could be a wide range of things.

I think there is some confusion about whether the term "Basic Latin" excludes the C0 control character range. Formally the standard seems clear enough to me that it is co-terminus with ASCII, but there is still confusion if you don't pore through the Standard. My thought is that maybe the Blocks.txt data should be modified to say "ASCII (Basic Latin)" instead of just "Basic Latin". (If we "go there", I would appreciate the wisdom of an experienced Unicode co-author. I am not confident touching that just by myself.)

Sean

Reply via email to