RE: [ID 20020130.001] Unicode broken for 0x10FFFF

Brent Dax Wed, 30 Jan 2002 22:23:20 -0800

Larry Wall:
# For various reasons, some of which relate to the sequence-of-integer
# abstraction, and some of which relate to "infinite" strings
# and arrays,
# I think Perl 6 strings are likely to be represented by a list of
# chunks, where each chunk is a sequence of integers of the same size or
# representation, but different chunks can have different integer sizes
# or representations.  The abstract string interface must hide this from
# any module that wishes to work at the abstract string level.  In
# particular, it must hide this from the regex engine, which works on
# pure sequences in the abstract.
#
# Note that I did not use the phrase "pure sequences of integers" in the
# last sentence.  The regex engine must not care if it is matching
# characters from a string of known length, or tokens objects from an
# array that is being grown arbitrarily on demand.  Matching on UTF-32
# is not good enough.
#
# This is just a heads up for some of the stuff in Apocalypse 5.
# Backtracking behavior will not necessarily be limited to regexes in
# Perl 6, and if so, we have to consider very carefully how regex
# backtracking, continuations, and temp variable unifications all work
# together.  (This is part of the reason I pushed earlier for the regex
# opcodes to be meshed with the normal opcodes.)
#
# I seriously intend that it be trivial to write a Perl parser (or any
# other parser) in Perl, and that changing a grammar rule be as
# simple as
# swapping in a different qr// (or a sub equivalent to a qr//).  More
# generally, I want logic programming to be one of the paradigms that
# Perl supports.  And as usual, I want to support it without forcing it
# on people who aren't interested.


As the regex guy for Parrot, my first response to this sounded something
like "oh, crap".  This'll be hard to make efficient, hard to implement
for all cases, and all that.  But as I thought about it more, I realized
that there's a fairly easy way to do this.

The first thing is to make sure that, at the Parrot level, "$left =~
$right" calls $right->vtable->match, not $left.  The second thing is to
make sure that =~ on characters (or character streams) is the same as
"eq"--character-set-independent comparison.

Once that's done, it's quite easy.

A regex becomes a series of =~ operations.  For example, let's say @toke
contains a series of tokens:

        @toke=(... new Perl6::Toke::Term(), new Perl6::Toke::Operator::Plus(),
new Perl6::Toke::Term() ...);

Now, assume that \t{Foo} in a regex is like $curitem =~
Perl6::Toke::Foo.  (I assume Larry will come up with a more general
mechanism, but you get the idea.)

Finally, assume =~ on classes is an ISA search.

Now, to find the first addition operation in the given token stream, you
just do something like this:

        @toke =~ m<\t{Value}\t{Operator::Plus}\t{Value}>;

To find the first unary plus operator:

        @toke =~ m<(?<!\t{Value})\t{Operator::Plus}>;  #or something like that

To compress all value/addition-precedence-operator/value sequences into
value tokens:

        @toke =~ s<
                \t{Value}
                [
                        \t{Operator::Plus}
                        \t{Operator::Minus}
                        \t{Operator::Underscore}
                ]
                \t{Value}
        ><
                new Perl6::Toke::Value($&)
        >eg;

Now, check this one out:

        $unop=qr[(?<!\t{Value})];

        @rules = (...,
                qr<
                        (?:
                                $unop
                                [
                                        \t{Operator::PlusPlus}
                                        \t{Operator::MinusMinus}
                                ]
                                \t{LValue}
                        ) | (?:
                                \t{LValue}
                                [
                                        \t{Operator::PlusPlus}
                                        \t{Operator::MinusMinus}
                                ]
                                (?!\t{Value})
                        )
                >, qr<
                        \t{Value}
                        \t{Operator::StarStar}
                        \t{Value}
                >r, qr<
                        $unop
                        [
                                \t{Operator::Exclamation}
                                \t{Operator::Tilde}
                                \t{Operator::Backslash}
                                \t{Operator::Plus}
                                \t{Operator::Minus}
                        ]
                        \t{Value}
                >, qr<
                        \t{Value}
                        [
                                \t{Operator::EqualsTilde}
                                \t{Operator::ExclamationTilde}
                        ]
                        \t{Value}
                >, qr<
                        \t{Value}
                        [
                                \t{Operator::Star}
                                \t{Operator::Slash}
                                \t{Operator::Percent}
                                \t{Operator::X}
                        ]
                        \t{Value}
                >, qr<
                        \t{Value}
                        [
                                \t{Operator::Plus}
                                \t{Operator::Minus}
                                \t{Operator::Underscore}
                        ]
                        \t{Value}
                >, ...
        );

        ($top)=map { @toke =~ s/$_/new Perl6::Toke::Value($&)/e } @rules;

For those who can't see what that is (and I don't blame you if you
can't) it's precedence levels 3-8 for Perl 5.  In one statement, @toke
would be turned into a syntax tree.  That tree would then undergo
further analysis to turn all the Operator::Foos into things like Add or
Numify, based on each node's siblings.

Implementation-wise, things get fairly easy once the vtable->match stuff
I mentioned above is in place.  Obviously we would optimize for string
matching.  Also, we WILL lose some performance, even over the current
slowdown compared to Perl 5.  Nevertheless, it can be done.

--Brent Dax
[EMAIL PROTECTED]
Parrot Configure pumpking and regex hacker
Check out the Parrot FAQ: http://www.panix.com/~ziggy/parrot.html (no,
it's not mine)

<obra> mmmm. hawt sysadmin chx0rs
<lathos> This is sad. I know of *a* hawt sysamin chx0r.
<obra> I know more than a few.
<lathos> obra: There are two? Are you sure it's not the same one?

RE: [ID 20020130.001] Unicode broken for 0x10FFFF

Reply via email to