Re: [ID 20020130.001] Unicode broken for 0x10FFFF

Larry Wall Wed, 30 Jan 2002 10:33:23 -0800

Jarkko Hietaniemi writes:
: > What I notice, though, is that the current code does not warn for
: > characters beyond 0x10FFFF, which is definitely a bug.
: 
: Ahh, it's all coming back now... warning about such characters
: causes pain in the complementing tr///... have to look at this later.


I think the general policy of Perl should be that it is allowed to
think about bad thoughts, because that is the only way to understand
what's bad about the bad thoughts Perl receives on input.  If there is
to be any self-censorship, it should be on the output, I believe.
That's why they're called "disciplines", after all. :-) So it's fine if
the default output discipline enforces that the internal representation
is transformed to well-formed UTF-8.  It's even okay if the default
input discipline enforces well-formedness, as long as there's a way
to get at the raw badness.

But within Perl, character strings are simply sequences of integers.
The internal representation must be optimized for this concept, not for
any particular Unicode representation, whether UTF-8 or UTF-16 or
UTF-32.  Any of these could be used as underlying representations, but
the abstraction of sequences of integers must be there explicitly in
the internal high-level string API.  To oversimplify, the high-level
API must not have any parameters whose type contains the string "UTF".

In the absence of other type information, these integers are assumed
to be Unicode code points.  Additional strictures are possible and even
useful, but should not be the default (except for certain operations that
are explicitly designed for Unicode.)

For various reasons, some of which relate to the sequence-of-integer
abstraction, and some of which relate to "infinite" strings and arrays,
I think Perl 6 strings are likely to be represented by a list of
chunks, where each chunk is a sequence of integers of the same size or
representation, but different chunks can have different integer sizes
or representations.  The abstract string interface must hide this from
any module that wishes to work at the abstract string level.  In
particular, it must hide this from the regex engine, which works on
pure sequences in the abstract.

Note that I did not use the phrase "pure sequences of integers" in the
last sentence.  The regex engine must not care if it is matching
characters from a string of known length, or tokens objects from an
array that is being grown arbitrarily on demand.  Matching on UTF-32
is not good enough.

This is just a heads up for some of the stuff in Apocalypse 5.
Backtracking behavior will not necessarily be limited to regexes in
Perl 6, and if so, we have to consider very carefully how regex
backtracking, continuations, and temp variable unifications all work
together.  (This is part of the reason I pushed earlier for the regex
opcodes to be meshed with the normal opcodes.)

I seriously intend that it be trivial to write a Perl parser (or any
other parser) in Perl, and that changing a grammar rule be as simple as
swapping in a different qr// (or a sub equivalent to a qr//).  More
generally, I want logic programming to be one of the paradigms that
Perl supports.  And as usual, I want to support it without forcing it
on people who aren't interested.

Sorry I can't be more clear yet.  Story of my life.  That's the basic
problem with the bear-of-very-little-brain approach.  So please "bear"
with me.

[I've cross-posted because of the wide interest, but I don't want to
start a general frenzy cross-posted to all the lists.  Please answer
specific points in separate messages, and please direct each followup
to the appropriate list.  Thanks.]

Larry

Re: [ID 20020130.001] Unicode broken for 0x10FFFF

Reply via email to