Re: [Scheme-reports] DISCUSSION/VOTE: The character tower

Bear Tue, 06 May 2014 10:35:40 -0700

On Tue, 2014-05-06 at 02:22 -0400, John Cowan wrote:
> Bear scripsit:
> 
> > Yes, with the exception of code points which are not actually mapped to
> > any character by the Unicode standard. 
> 
> For clarification, which of these do you mean?
> 
> (a) Code points which will never correspond to any character, namely
> the surrogates?  (These are already excluded by -small.)
> 
> (b) Code points for reserved noncharacters (there are 65 of these;
> they are not to be used in interchange, but may be useful internally to
> a program)?
> 
> (c) Codepoints that will (or at least may) be assigned to characters in
> future versions of Unicode?



I was referring to (a) and (b), plus things like 'tag 
characters' which are irrevocably mapped in the standard but 
which the current standard says not to use because that was 
a design mistake.  I am ambivalent about (c); on the one hand
I don't want nonsense points to be a possibility in strings, 
but on the other given that at some point people may be handling 
Unicode characters defined by a standard newer than the 
implementation, there is at least a possibility that requiring 
implementations to handle them is rational.


> > > 7) Should R7RS-large implementations be required to
> > > provide the characters from #\x10000 to #\x10FFFF?  
> > 
> > No.
> 
> I'm curious why you reject these, seemingly out of hand.  They are
> required by a lot of scripts, though mostly archaic and minority-use ones.
> You similarly reject #11 without explanation.

Over 90% of these codepoints are nonsense not mapped to 
any character, and I have not yet encountered any need 
for the few remaining.  I fear that if programmers are 
guaranteed that many nonsense code points and code points 
they're not personally using, they're going to start 
abusing them for some semantics-breaking purpose like 
encoding floats, or otherwise treating strings as 
blobs.

That said, implementers should definitely be *allowed* to
support these characters, and I assume that most will. 

> > > 8) Should R7RS-large implementations be required to allow #\x0 in strings?
> > 
> > Abstention.  If an implementation is serious enough about Unicode
> > support to keep its strings in a Unicode normalized form, which ought
> > not be forbidden, then NUL can never appear in any string. 
> 
> I don't understand this remark at all.  The normalized form of the U+0000
> character under any normalization form is quite simply itself.  The
> internal encoding of the characters with or without 0 bytes is not
> relevant here.

Is there ever a reason for a string to contain NUL?  NUL has no 
semantics.  It is a nonsense point.  There is no concatenation, 
substring, insertion, case operation, etc, of any linguistically
meaningful string in normal form which can result in a normalized 
string with a NUL in it.  NUL is a concession to using strings as 
blobs; now that we actually have blobs, we don't need it in strings.

> > Yes, with the exception of code points which are not actually mapped to 
> > any character by the unicode standard and code points which have a
> > canonical decomposition (ie, the standard ought to allow an
> > implementation to implement strings as unicode normalized strings). 
> 
> That is, in normalization form D, I assume you mean.  (Normalization form
> C is more commonly used, and actually encourages the use of characters
> with a canonical decomposition.)

If we presume that the standard should leave a choice of internal 
normalization form to the implementations, then there are many 
characters which the standard cannot require an implementation to 
allow in strings.  Those with a canonical or compatibility 
decomposition, and those which are for whatever reason nonsense 
points.  There are many characters (such as ligatures) which 
have canonical decompositions, but which are not themselves the 
result of canonical compositions, which cannot appear even in NFC, 
and which the standard therefore cannot require implementations to
allow in strings.

That was what I had in mind, but on the other hand there may 
actually be a good reason for the standard to pick an internal 
string normalization form for all implementations.  If there 
really is a good reason, and it doesn't create an incompatibility 
with R7RS-small, then the standard *should* pick a normalization 
form. 

Bear



_______________________________________________
Scheme-reports mailing list
[email protected]
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports

Re: [Scheme-reports] DISCUSSION/VOTE: The character tower

Reply via email to