Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
<austin_hasti...@yahoo.com> wrote:
If you haven't read the PDD, it's a good start.
<snip useful summary>
I get all that, really. I still question the necessity of mapping
each grapheme to a single integer. A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely. But why does
ord($weird_grapheme) have to be a *numeric* value? If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.
My feelings, in general. It appears that the concept of mapping total
graphemes to integers, negative, etc. is an implementation decision.
Perl 6 strings has a concept of graphemes, and functions that work with
them. But the core language specification should keep that as general
as possible, and allow implementation freedom. The statement that "base
moda modb" produces the same grapheme value as "base modb moda" is at
the correct level. The statement "the grapheme is an Int" is not only
at the wrong level, but not right, as they should be their own distinct
type. I think that the PDD details of assigning negative values as
encountered AND the idea of being a list of code points in some
normalized form, AND the idea of it being a buffer of bytes in UTF8 with
that list of code points encoded therein, are all *allowed* as correct
implementations. So is having a type whose instance data stores it in
however many forms it wants, and for the Perl end of things you just let
the === operator take its natural course.
If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter. I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.
Well, you can view a string as bytes of UTF8, code points, or
graphemes. If you want numbers you probably wanted the first two. A
grapheme object should in some ways behave as a string of 1 grapheme and
allow you to obtain bytes of UTF8 or code points, easily.
Now object identity, the "address" of an object, is not mandated to be
an Int or even numeric. Different types can return different things
even. The only thing we know is that infix:<===> uses them.
Should graphemes be any different? A grapheme object has observed
behavior ("encode it as...") and internal unobserved behavior. Perhaps
we need more assertions such as saying that it can serve as hash keys
properly, rather than going all the way to saying that they must be
numbered. Especially with an internal numbering system that changes
from run to run!
Meanwhile... that's what the Str class does. It still has nothing to do
with how source code is parsed. To that extent, mentioning it in S02,
at least in that section, is a mistake. A see-also to general Perl
Unicode documentation would not be objectionable.
Also, I described more detailed, formal handling of the input stream to
the Perl 6 parser last year: <http://www.dlugosz.com/Perl6/specdoc.pdf>
in Section 3.1. It was discussed on this mailing list when I was
starting it.
--John