On Tue, 2014-05-06 at 14:45 -0400, John Cowan wrote:
> Bear scripsit:

> > If we presume that the standard should leave a choice of internal 
> > normalization form to the implementations, 
> 
> <chair hat="off">
> Now we get into the tight and the bad and the crazy.  Do you really
> think that it should be up to the implementation to choose whether
> (string-length (string #\A #\x0301)) returns 1 or 2?  I submit
> that for the (scheme base) versions of these procedures the result
> must be 2.  If you want auto-normalizing, you need to create a
> parallel string library that provides it.
> </chair>
> 
> (#\x0301 is COMBINING ACUTE ACCENT, which following A makes it Á.)

That ship has sailed, I fear.  When R6 and R7-small were being 
discussed, I advocated a model of characters as Unicode base 
character plus nondefective combining sequence, which would 
solve this problem and unambiguously make this string length 1, 
regardless of which normalization form were used to represent the
value in memory.  It would also make 'indexing' and locations 
of characters in strings unambiguous, mostly eliminate length 
changes on case operations, preserve the string-as-sequence-of
characters semantics we had, and yield a character semantics 
cleaner and less ambiguous than Unicode's and capable of being 
mapped cleanly onto other character repertoires or representations,
all of which I thought of as good things.

But it was without any widespread implementation support.  
Further, it would make the char library you referred to in the 
last question impossible to implement, because in that model 
there are a literally infinite number of possible characters, 
and then you have a halting-problem issue when trying to 
iterate over them all. So it was rejected.

That rejection means we now have a model in which a character is 
a Unicode codepoint.  Because there are multiple ways to express 
given strings as sequences of unicode codepoints, the identity 
of a string is now unglued from the identities of the characters 
from which it was built and we no longer have clean string-as-
sequence-of-characters semantics.  Questions such as 

(string-length (string #\A #\x301)) 

are therefore now irrevocably dependent on unicode, and 
specifically on unicode normalization form.  

So we now have this choice; either the standard dictates 
what normalization form to use in strings, or string-length 
etc have implementation-defined semantics.

                        Bear








_______________________________________________
Scheme-reports mailing list
[email protected]
http://lists.scheme-reports.org/cgi-bin/mailman/listinfo/scheme-reports

Reply via email to