Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
<austin_hasti...@yahoo.com> wrote:
If you haven't read the PDD, it's a good start.

<snip useful summary>

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

My feelings, in general. It appears that the concept of mapping total graphemes to integers, negative, etc. is an implementation decision. Perl 6 strings has a concept of graphemes, and functions that work with them. But the core language specification should keep that as general as possible, and allow implementation freedom. The statement that "base moda modb" produces the same grapheme value as "base modb moda" is at the correct level. The statement "the grapheme is an Int" is not only at the wrong level, but not right, as they should be their own distinct type. I think that the PDD details of assigning negative values as encountered AND the idea of being a list of code points in some normalized form, AND the idea of it being a buffer of bytes in UTF8 with that list of code points encoded therein, are all *allowed* as correct implementations. So is having a type whose instance data stores it in however many forms it wants, and for the Perl end of things you just let the === operator take its natural course.

        If you're doing arithmetic with the code points or scalar values of
        characters, then the specific numbers would seem to matter.  I'm
        looking for the use case where the fact that it's an integer matters
        but the specific value doesn't.


Well, you can view a string as bytes of UTF8, code points, or graphemes. If you want numbers you probably wanted the first two. A grapheme object should in some ways behave as a string of 1 grapheme and allow you to obtain bytes of UTF8 or code points, easily. Now object identity, the "address" of an object, is not mandated to be an Int or even numeric. Different types can return different things even. The only thing we know is that infix:<===> uses them.

Should graphemes be any different? A grapheme object has observed behavior ("encode it as...") and internal unobserved behavior. Perhaps we need more assertions such as saying that it can serve as hash keys properly, rather than going all the way to saying that they must be numbered. Especially with an internal numbering system that changes from run to run!

Meanwhile... that's what the Str class does. It still has nothing to do with how source code is parsed. To that extent, mentioning it in S02, at least in that section, is a mistake. A see-also to general Perl Unicode documentation would not be objectionable.

Also, I described more detailed, formal handling of the input stream to the Perl 6 parser last year: <http://www.dlugosz.com/Perl6/specdoc.pdf> in Section 3.1. It was discussed on this mailing list when I was starting it.

--John

Reply via email to