Nicholas Clark wrote: > On Sun, Apr 16, 2006 at 11:22:40AM -0700, Patrick R. Michaud wrote: >> $S1 = "He said, \xabHello\xbb" >> $S2 = "3 \u2212 4 = \u207b 1" >> >> are treated as ASCII strings even though they obviously contain >> codepoints outside of the ASCII range. (The first results in a >> 'malformed string' error when compiled, the second chops off the >> high-order bits of the \u sequence.) > > IIRC having ASCII as the default was a deliberate design choice to avoid > the confusion of "is it iso-8859-1 or is it utf-8" when encountering a > string literal with bytes outside the range 0-127.
Aye, it was auto-promoting to latin1 and was changed to ascii-by-default by me and Leo a while ago. > If so, then I assume that the behaviour of your second example is wrong - it > should also be a malformed string. > > If PGE is always outputting UTF-8 literals, what stops it from always > prefixing every literal "unicode:", even if it only uses Unicode characters > 0 to 127? Indeed, it would be much easier if unicode:"" on an ascii-only string can automatically go back to use ascii for representation, and choose to use utf8 (or better, latin1/ucs2) only iff there is high-bit parts in it. A Perlish way to solve this is to introduce another pragma, similar to "n_operators", that controls the encoding of all string literals of the PIR program: .pragma encoding utf8 Once written that way, you can simply use literal « » in the program, which reads better than \xab and \xbb anyway... :-) Audrey
signature.asc
Description: OpenPGP digital signature