Um... How could you be sure that process X would get the no-ops that process W wrote?  After all, it's *discardable*, like you said, and the database programs and libraries aren't in on the secret.  The database API functions might well strip it out, because it carries no meaning to them. Unless you can count on _certain_ programs not discarding it, and then you'd need either specialty libraries or some kind of registry or terminology for "this program does NOT strip no-ops" vs ones that do... But then they wouldn't be discardable, would they?  Not by non-discarding programs.  Which would have to have ways to pass them around between themselves.

Moreover, as you say, what about when Process Z (or its companions) comes along and is using THE SAME MECHANISM for something utterly different?  How does it know that process W wasn't writing no-ops for it, but was writing them for Process X?  And of course, Z will trash them and insert its own there, and when process X comes to read it, they won't be there. You'd need to make sure that NOBODY is allowed to touch the string between *pairs* of generators and consumers of no-ops, specifically designated for each other.

Yes, this is about consensual acts between responsible processes W and X, but that's exactly what the PUA is for: being assigned meaning between consenting processes. And they are not discardable by non-consenting processes, precisely because they mean something to someone.  If your no-ops carry meaning, they are going to need to be preserved and passed around and not thrown away.  If they carry no meaning, why are you dealing with them?  Yes, PUA characters are annoying and break up grapheme clusters and stuff.  But they're the only way to do what you're trying to do.

~mark

On 7/3/19 11:44 AM, Sławomir Osipiuk via Unicode wrote:

A process, let’s call it Process W, adds a bunch of U+000F to a string it received, or built, or a user entered via keyboard. Maybe it’s to packetize. Maybe to mark every word that is an anagram of the name of a famous 19^th -century painter, or that represents a pizza topping. Maybe something else. This is a versatile character. Process W is done adding U+000F to the string. It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The database is happy.

Now Process X runs. Process X is meant to work with Process W and it’s well-aware of how U+000F is used. It reads the string from the database. It sees U+000F and interprets it. It chops the string into packets, or does a websearch for each famous painter, or it orders pizza. The private meaning of U+000F is known to both Process X and Process W. There is useful information encoded in-band, within a limited private context.

But now we have Process Y. Process Y doesn’t care about packets or painters or pizza. Process Y runs outside of the private context that X and W had. Process Y translates strings into Morse code for transmission. As part of that, it replaces common words with abbreviations. Process Y doesn’t interpret U+000F. Why would it? It has no semantic value to Process Y.

Process Y reads the string from the database. Internally, it clears all instances of U+000F from the string. They’re just taking up space. They’re meaningless to Y. It compiles the Morse code sequence into an audio file.

But now we have Process Z. Process Z wants to take a string and mark every instance of five contiguous Latin consonants. It scrapes the database looking for text strings. It finds the string Process W created and marked. Z has no obligation to W. It’s not part of that private context. Process Z clears all instances of U+000F it finds, then inserts its own wherever it finds five-consonant clusters. It stores its results in a UTF-16LE text file. It’s allowed to do that.

Nothing impossible happened here. Let’s summarize:

Processes W and X established a private meaning for U+000F by agreement and interacted based on that meaning.

Process Y ignored U+000F completely because it assigned no meaning to it.

Process Z assigned a completely new meaning to U+000F. That’s permitted because U+000F is special and is guaranteed to have no semantics without private agreement and doesn’t need to be preserved.

There is no need to escape anything. Escaping is used when a character must have more than one meaning (i.e. it is overloaded, as when it is both text and markup). U+000F only gets one meaning in any context. In a new context, the meaning gets overridden, not overloaded. That’s what makes it special.

I don’t expect to see any of this in official Unicode. But I take exception to the idea that I’m suggesting something impossible.

*From:*Philippe Verdy [mailto:verd...@wanadoo.fr]
*Sent:* Wednesday, July 03, 2019 04:49
*To:* Sławomir Osipiuk
*Cc:* unicode Unicode Discussion
*Subject:* Re: Unicode "no-op" Character?

Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all.


Reply via email to