RE: Unicode "no-op" Character?

Sławomir Osipiuk via Unicode Wed, 03 Jul 2019 08:47:27 -0700

I’m frustrated at how badly you seem to be missing the point. There is nothing 
impossible nor self-contradictory here. There is only the matter that Unicode 
requires all scalar values to be preserved during interchange. This is in many 
ways a good idea, and I don’t expect it to change, but something else would be 
possible if this requirement were explicitly dropped for a well-defined small 
subset of characters (even just one character). A modern-day SYN.


 

Let’s say it’s U+000F. The standard takes my proposal and makes it a 
discardable, null-displayable character. What does this mean?

 

U+000F may appear in any text. It has no (external) semantic value. But it may 
appear. It may appear a lot.

 

Display routines (which are already dealing with combining, ligaturing, 
non-/joiners, variations, initial/medial/finals forms) understand that U+000F 
is to be processed as a no-op. Do nothing with this. Drop it. Move to the next 
character. Simple.

 

Security gateways filter it out completely, as a matter of best practice and 
security-in-depth.

 

A process, let’s call it Process W, adds a bunch of U+000F to a string it 
received, or built, or a user entered via keyboard. Maybe it’s to packetize. 
Maybe to mark every word that is an anagram of the name of a famous 
19th-century painter, or that represents a pizza topping. Maybe something else. 
This is a versatile character. Process W is done adding U+000F to the string. 
It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The 
database is happy.

 

Now Process X runs. Process X is meant to work with Process W and it’s 
well-aware of how U+000F is used. It reads the string from the database. It 
sees U+000F and interprets it. It chops the string into packets, or does a 
websearch for each famous painter, or it orders pizza. The private meaning of 
U+000F is known to both Process X and Process W. There is useful information 
encoded in-band, within a limited private context.

 

But now we have Process Y. Process Y doesn’t care about packets or painters or 
pizza. Process Y runs outside of the private context that X and W had. Process 
Y translates strings into Morse code for transmission. As part of that, it 
replaces common words with abbreviations. Process Y doesn’t interpret U+000F. 
Why would it? It has no semantic value to Process Y.

 

Process Y reads the string from the database. Internally, it clears all 
instances of U+000F from the string. They’re just taking up space. They’re 
meaningless to Y. It compiles the Morse code sequence into an audio file.

 

But now we have Process Z. Process Z wants to take a string and mark every 
instance of five contiguous Latin consonants. It scrapes the database looking 
for text strings. It finds the string Process W created and marked. Z has no 
obligation to W. It’s not part of that private context. Process Z clears all 
instances of U+000F it finds, then inserts its own wherever it finds 
five-consonant clusters. It stores its results in a UTF-16LE text file. It’s 
allowed to do that.

 

Nothing impossible happened here. Let’s summarize:

 

Processes W and X established a private meaning for U+000F by agreement and 
interacted based on that meaning.

 

Process Y ignored U+000F completely because it assigned no meaning to it.

 

Process Z assigned a completely new meaning to U+000F. That’s permitted because 
U+000F is special and is guaranteed to have no semantics without private 
agreement and doesn’t need to be preserved.

 

There is no need to escape anything. Escaping is used when a character must 
have more than one meaning (i.e. it is overloaded, as when it is both text and 
markup). U+000F only gets one meaning in any context. In a new context, the 
meaning gets overridden, not overloaded. That’s what makes it special.

 

I don’t expect to see any of this in official Unicode. But I take exception to 
the idea that I’m suggesting something impossible.

 

 

From: Philippe Verdy [mailto:verd...@wanadoo.fr] 
Sent: Wednesday, July 03, 2019 04:49
To: Sławomir Osipiuk
Cc: unicode Unicode Discussion
Subject: Re: Unicode "no-op" Character?

 

Your goal is **impossible** to reach with Unicode. Assume sich character is 
"added" to the UCS, then it can appear in the text. Your goal being that it 
should be "warrantied" not to be used in any text, means that your "character" 
cannot be encoded at all.

RE: Unicode "no-op" Character?

Reply via email to