Jarkko Hietaniemi writes: : > What I notice, though, is that the current code does not warn for : > characters beyond 0x10FFFF, which is definitely a bug. : : Ahh, it's all coming back now... warning about such characters : causes pain in the complementing tr///... have to look at this later.
I think the general policy of Perl should be that it is allowed to think about bad thoughts, because that is the only way to understand what's bad about the bad thoughts Perl receives on input. If there is to be any self-censorship, it should be on the output, I believe. That's why they're called "disciplines", after all. :-) So it's fine if the default output discipline enforces that the internal representation is transformed to well-formed UTF-8. It's even okay if the default input discipline enforces well-formedness, as long as there's a way to get at the raw badness. But within Perl, character strings are simply sequences of integers. The internal representation must be optimized for this concept, not for any particular Unicode representation, whether UTF-8 or UTF-16 or UTF-32. Any of these could be used as underlying representations, but the abstraction of sequences of integers must be there explicitly in the internal high-level string API. To oversimplify, the high-level API must not have any parameters whose type contains the string "UTF". In the absence of other type information, these integers are assumed to be Unicode code points. Additional strictures are possible and even useful, but should not be the default (except for certain operations that are explicitly designed for Unicode.) For various reasons, some of which relate to the sequence-of-integer abstraction, and some of which relate to "infinite" strings and arrays, I think Perl 6 strings are likely to be represented by a list of chunks, where each chunk is a sequence of integers of the same size or representation, but different chunks can have different integer sizes or representations. The abstract string interface must hide this from any module that wishes to work at the abstract string level. In particular, it must hide this from the regex engine, which works on pure sequences in the abstract. Note that I did not use the phrase "pure sequences of integers" in the last sentence. The regex engine must not care if it is matching characters from a string of known length, or tokens objects from an array that is being grown arbitrarily on demand. Matching on UTF-32 is not good enough. This is just a heads up for some of the stuff in Apocalypse 5. Backtracking behavior will not necessarily be limited to regexes in Perl 6, and if so, we have to consider very carefully how regex backtracking, continuations, and temp variable unifications all work together. (This is part of the reason I pushed earlier for the regex opcodes to be meshed with the normal opcodes.) I seriously intend that it be trivial to write a Perl parser (or any other parser) in Perl, and that changing a grammar rule be as simple as swapping in a different qr// (or a sub equivalent to a qr//). More generally, I want logic programming to be one of the paradigms that Perl supports. And as usual, I want to support it without forcing it on people who aren't interested. Sorry I can't be more clear yet. Story of my life. That's the basic problem with the bear-of-very-little-brain approach. So please "bear" with me. [I've cross-posted because of the wide interest, but I don't want to start a general frenzy cross-posted to all the lists. Please answer specific points in separate messages, and please direct each followup to the appropriate list. Thanks.] Larry