RE: Unicode "no-op" Character?

Shawn Steele via Unicode Wed, 03 Jul 2019 17:01:07 -0700

I think you're overstating my concern :)

I meant that those things tend to be particular to a certain context and often 
aren't interesting for interchange.  A text editor might find it convenient to 
place word boundaries in the middle of something another part of the system 
thinks is a single unit to be rendered.  At the same time, a rendering engine 
might find it interesting that there's an ff together and want to mark it to be 
shown as a ligature though that text editor wouldn't be keen on that at all.

As has been said, these are private mechanisms for things that individual 
processes find interesting.  It's not useful to mark those for interchange as 
the text editors word breaking marks would interfere with the graphics engines 
glyph breaking marks.  Not to mention the transmission buffer size marks 
originally mentioned, which could be anywhere.

The "right" thing to do here is to use an internal higher level mechanism to 
keep track of these things however the component needs.  That can even be 
interchanged with another component designed to the same principles, via 
mechanisms like the PUA.  However, those components can't expect their private 
mechanisms are useful or harmless to other processes.  

Even more complicated is that, as pointed out by others, it's pretty much 
impossible to say "these n codepoints should be ignored and have no meaning" 
because some process would try to use codepoints 1-3 for some private meaning.  
Another would use codepoint 1 for their own thing, and there'd be a conflict.  

As a thought experiment, I think it's certainly decent to ask the question 
"could such a mechanism be useful?"  It's an intriguing thought and a decent 
hypothesis that this kind of system could be privately useful to an 
application.  I also think that the conversation has pretty much proven that 
such a system is mathematically impossible.  (You can't have a "private" 
no-meaning codepoint that won't conflict with other "private" uses in a public 
space).

It might be worth noting that this kind of thing used to be fairly common in 
early computing.  Word processers would inject a "CTRL-I" token to toggle 
italics on or off.  Old printers used to use sequences to define the start of 
bold or italic or underlined or whatever sequences.  Those were private and 
pseudo-private mechanisms that were used internally &/or documented for others 
that wanted to interoperate with their systems.  (The printer folks would tell 
the word processers how to make italics happen, then other printer folks would 
use the same or similar mechanisms for compatibility - except for the dude that 
didn't get the memo and made their own scheme.)

Unicode was explicitly intended *not* to encode any of that kind of markup, 
and, instead, be "plain text," leaving other interesting metadata to other 
higher level protocols.  Whether those be word breaking, sentence parsing, 
formatting, buffer sizing or whatever.

-Shawn

-----Original Message-----
From: Unicode <[email protected]> On Behalf Of Richard Wordingham via 
Unicode
Sent: Wednesday, July 3, 2019 4:20 PM
To: [email protected]
Subject: Re: Unicode "no-op" Character?

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode" <[email protected]> wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making 
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting reason 
for separating base character and combining mark.  I was refuting that notion.  
Natural text boundaries can get very messy - some languages have word 
boundaries that can be *within* an indecomposable combining mark.

Richard.

RE: Unicode "no-op" Character?

Reply via email to