Re: "Unicode in 'NFG' formation" ?

John M. Dlugosz Mon, 18 May 2009 17:41:45 -0700

Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
<austin_hasti...@yahoo.com> wrote:

If you haven't read the PDD, it's a good start.


<snip useful summary>

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

My feelings, in general. It appears that the concept of mapping totalgraphemes to integers, negative, etc. is an implementation decision.Perl 6 strings has a concept of graphemes, and functions that work withthem. But the core language specification should keep that as generalas possible, and allow implementation freedom. The statement that "basemoda modb" produces the same grapheme value as "base modb moda" is atthe correct level. The statement "the grapheme is an Int" is not onlyat the wrong level, but not right, as they should be their own distincttype. I think that the PDD details of assigning negative values asencountered AND the idea of being a list of code points in somenormalized form, AND the idea of it being a buffer of bytes in UTF8 withthat list of code points encoded therein, are all *allowed* as correctimplementations. So is having a type whose instance data stores it inhowever many forms it wants, and for the Perl end of things you just letthe === operator take its natural course.

        If you're doing arithmetic with the code points or scalar values of
        characters, then the specific numbers would seem to matter.  I'm
        looking for the use case where the fact that it's an integer matters
        but the specific value doesn't.

Well, you can view a string as bytes of UTF8, code points, orgraphemes. If you want numbers you probably wanted the first two. Agrapheme object should in some ways behave as a string of 1 grapheme andallow you to obtain bytes of UTF8 or code points, easily.Now object identity, the "address" of an object, is not mandated to bean Int or even numeric. Different types can return different thingseven. The only thing we know is that infix:<===> uses them.

Should graphemes be any different? A grapheme object has observedbehavior ("encode it as...") and internal unobserved behavior. Perhapswe need more assertions such as saying that it can serve as hash keysproperly, rather than going all the way to saying that they must benumbered. Especially with an internal numbering system that changesfrom run to run!

Meanwhile... that's what the Str class does. It still has nothing to dowith how source code is parsed. To that extent, mentioning it in S02,at least in that section, is a mistake. A see-also to general PerlUnicode documentation would not be objectionable.

Also, I described more detailed, formal handling of the input stream tothe Perl 6 parser last year: <http://www.dlugosz.com/Perl6/specdoc.pdf>in Section 3.1. It was discussed on this mailing list when I wasstarting it.


--John

Re: "Unicode in 'NFG' formation" ?

Reply via email to