Re: A12: Strings

Larry Wall Tue, 20 Apr 2004 22:52:58 -0700

On Tue, Apr 20, 2004 at 02:16:01PM -0400, Aaron Sherman wrote:
: Well, I have a lot to digest, but off the top of my head (and having
: nothing to do with objects, but rather the string discussion at the
: end), it would be very useful if I could assert:
: 
:       no string "complex";
: 
: or something like that. That is to say, I would love to have a way to
: say that my strings are just plain old C-style arrays of 8-bit
: characters.


Yes, that's in the works.  The plan is to have four Unicode support levels.

    Level 0     character = byte
    Level 1     character = codepoint
    Level 2     character = grapheme
    Level 3     character = letter

These would be declared by lexically scoped declarations:

    use bytes 'ISO-8859-1';
    use codepoints;
    use graphemes;
    use letters 'Turkish';

It's possible to get into level 0 with a bare "use bytes" but then
you just get "C" locale semantics.   Often you might specify which
8-bit semantics are the default.  It's not possible to get into
level 3 without declaring a specific language.  You can't just say
"use letters".  Possibly there's support for "use letters :locale",
but don't tell Jarkko.  :-)

Note these just warp the defaults.  Underneath is still a strongly
typed string system.  So you can say "use bytes" and know that the
strings that *you* create are byte strings.  However, if you get in a
string from another module, you can't necessarily process it as bytes.
If you haven't specified how such a string is to be processed in
your worldview, you're probably going to get an exception.  You might
anyway, if what you specified is an impossible downconversion.

So yes, you can have "use bytes", but it puts more responsibility on
you rather than less.  You might rather just specify the type of your
particular string or array, and stay with codepoints or graphemes in
the general case.  To the extent that we can preserve the abstraction
that a string is just a sequence of integers, the values of which
have some known relationship to Unicode, it should all just work.
In particular, latin-1 is by definition the 8-bit subset of Unicode,
so if you stick to those codepoints you're safe.  Functions and
interfaces that require 8-bit bytes will be able to convert such a
string regardless of its internal representation.

: I know that at a low level Parrot is still going to have its way with
: these, but at the very least, I want to be able to put the tag in there
: (lexically or otherwise) to make me feel better about myself as a human
: being when I do:
: 
:       my $n = '';
:       for @stuff -> $_ {$n ~= (defined($_)??1::0)}
:       my $stuff_as_bitvec = pack("b*",$n);
:       %state_is_known{$stuff_as_bitvec} = 1;
: 
: It's going to be hard for me to accept that that operation is going to
: have to worry about codepoints... really hard. Especially so if I'm
: doing this is a tight loop as I was recently.

If you never put anything into a string bigger than U+00ff, you're
guaranteed to get semantics indistinguishable from a byte string,
regardless of how the characters might actually be stored.  We aimed
for this ideal in Perl 5 but were never quite able to achieve it in
all the nooks and crannies of the language.  There was just too much
legacy to deal with.  Jarkko took it as far as humanly possible, and
in some cases farther.  But hopefully we can make a clean break from
the looney locale legacy with Perl 6.

: I suppose if there were a type:
: 
:       my Octets $stuff_as_bitvec = '';
:       ...
: 
: Then that would be a start, but even then what of the hashing operation?
: Will there be some property of a hash I have to set too?
: 
:       class Octets_Num_Pair is Pair {
:               my Octets $.key;
:               my Num $.val;
:               ... redefine key management in terms of Octets ...
:       }
:       my Octets_Num_Pair %state_is_known;

Hashes aren't declared to return pairs, but rather values.  If you need
to change the key type it's a trait on the storage class.  But...

: Is that right, or would there be a key_type property on hashes? More to
: the point, is it worth it, or will I be further slowing down hash access
: because it's special-cased in the default situation?

Hashes should handle various types of built-in key strings properly
by default.  It's only if you want to start hashing on objects that
you have to make sure your class "does" Hashkey or some such.

Larry

Re: A12: Strings

Reply via email to