Re: Concise term for non-ASCII Unicode characters

Sean Leonard Tue, 29 Sep 2015 09:51:09 -0700

On 9/21/2015 5:17 PM, Peter Constable wrote:

If you think it's a serious problem that there isn't one conventionalterm for "characters outside the ASCII repertoire" or "UTF-8multi-code-unit encoded representations" (since different authorscould devise different terminology solutions), then I suggest yousubmit a document to UTC explaining why it's a problem, documentinginconsistent or unclear terminology that's been used in some standards/ public specifications, and requesting that Unicode formally defineterminology for these concepts. I can't guarantee that UTC will do it,but I can predict with confidence that it _won't_ do anything of thatnature if nobody submits such a document. Peter

I am of the mind to do just that, then. I have seen different documents,standards, and standards bodies that have invented terminology aroundthis term, and they are not always the same. Since these standardsdepend on Unicode, it would make a lot of sense for Unicode formally todefine terminology for these concepts. With the proliferation of UTF-8(among other things), the boundary between 0x7F - 0x80 is moresignificant than the boundary between 0xFFFF - 0x10000.

Since this will be my first submission I would appreciate a co-author onthis topic. Is anyone willing to help? Thanks in advance. Also, it isnot clear if such a document is destined to become a Unicode TechnicalReport (UTR / PDUTR etc.), or if it should just be an informal write-up.I am guessing this is supposed to be somewhat informal but at the sametime it (or the results of it) ought to appear in the UTC Document Search.

The current terminology that I am considering pursuing is "beyondASCII", in various permutations, such as "beyond the ASCII range","characters beyond ASCII", "code points beyond ASCII", etc. The term"beyond" implies a certain directionality, and to that extent, impliesthe Unicode repertoire as well as a Unicode encoding. We have seen onthis list the blackflips required to clarify "non-ASCII", since thingsthat are not ASCII literally could be a wide range of things.

I think there is some confusion about whether the term "Basic Latin"excludes the C0 control character range. Formally the standard seemsclear enough to me that it is co-terminus with ASCII, but there is stillconfusion if you don't pore through the Standard. My thought is thatmaybe the Blocks.txt data should be modified to say "ASCII (BasicLatin)" instead of just "Basic Latin". (If we "go there", I wouldappreciate the wisdom of an experienced Unicode co-author. I am notconfident touching that just by myself.)


Sean

Re: Concise term for non-ASCII Unicode characters

Reply via email to